EBMT / NLP Laboratory

Graduate School of Information, Production and Systems, Waseda University

科学研究助学金概要

本奖学金旨在促进广泛的科学领域研究的创新性和开拓性,包括人文科学,社会科学和自然科学。 向日本大学和从事基础研究的研究机构的研究人员或组织提供项目资金支持,尤其是先进研究趋势相关领域。在这些资金支持下获得的研究成果广泛发表于学术期刊。

Kakenhi 15K00317

该项目是题为“统计机器翻译和基于实例的机器翻译的改进以及多语言语法模式的发布” 的Kakenhi 项目。 (Kakenhi C 15K00317) 详情

2016

  1. J. Luo and Y. Lepage. A method of generating translations of unseen n-grams by using proportional analogy. IEEJ Transactions in Electronics, Information and Systems, 11(3):325–330, May 2016. DOI:10.1002/tee.22221 [Preparatory work for the research topic of Kakenhi 15K00317]
  2. R. Fam and Y. Lepage. Morphological predictability of unseen words using computational analogy. In Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-16), pages 51–60, Atlanta, Georgia, October 2016.
  3. V. Kaveeta and Y. Lepage. Solving analogical equations between strings of symbols using neural networks. In Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-16), pages 67–76, Atlanta, Georgia, October 2016.
  4. W. Yang, M. Gao, and Y. Lepage. Production of analogical clusters between marker-based chunks in Chinese and Japanese. In Proceedings of the 10th International collaboration Symposium on Information, Production and Systems (ISIPS 2016), pages 238–241, IPS, Waseda University, November 2016.

 

2017 

  1. R. Fam, Y. Lepage, S. Gojali, and A. Purwarianti. A study in explaining unseen words in Indonesian using analogical clusters. In Proceedings of 15th International Conference on Computer Applications (ICCA 2017), pages 416–421, Yangon, Myanmar, January 2017.
  2. W. Yang, H. Shen, and Y. Lepage. Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese–Japanese machine translation. Journal of Information Processing, 25:88–99, 2017. DOI:10.2197/ipsjjip.25.88
  3. R. Fam, Y. Lepage, S. Gojali, and A. Purwarianti. Indonesian unseen words explained by form, morphology and distributional semantics at the same time. In Proceedings of the 23rd Annual Meeting of the Japanese Association for Natural Language Processing, pages 178–181, Tsukuba, Japan, March 2017.
  4. Y. Lepage. Clusters et grilles analogiques : validation par la traduction automatique (invited talk). 40 ans de TA, Grenoble, France, July 2017.
  5. R. Fam and Y. Lepage. A study of the saturation of analogical grids agnostically extracted from texts. In Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-17), pages 7–16, Trondheim, Norway, August 2017.
  6. Y. Lepage. Character–position arithmetic for analogy questions between word forms. In Proceedings of the Computational Analogy Workshop at the 24th International Conference on Case-Based Reasoning (ICCBR-17), pages 17–26, Trondheim, Norway, August 2017.
  7. P. Liu and Y. Lepage. Confidence of word forms generated in analogical grids. In Proceedings of the 11th International collaboration Symposium on Information, Production and Systems (ISIPS 2017), pages 238–240, IPS, Waseda University, Nov 2017.
  8. F. Rashel, A. Purwarianti, and Y. Lepage. Plausibility of word forms generated from analogical grids on Indonesian. In Proceedings of the 11th International collaboration Symposium on Information, Production and Systems (ISIPS 2017), pages 245–247, IPS, Waseda University, Nov 2017.
  9. R. Fam and Y. Lepage. A holistic approach at a morphological inflection task. In Proceedings of the 8th Language & Technology Conference (LTC’17), pages 88–92, Poznan, November 2017. Fundacja uniwersytetu im. Adama Mickiewicza.
  10. Y. Lepage. Automatic production of quasi-parallel corpora for machine translation (invited talk). In International Conference on Natural Language, Signal and Speech Processing 2017, Casablanca, Morocco, 06--07 Dec. 2017
 

2018 

  1. Y. Lepage. Analogy for natural language processing and machine translation (invited talk). In Proceedings of 15th International Conference on Computer Applications (ICCA 2017), Yangon, Myanmar, January 2018.
  2. R. Fam, A. Purwarianti, and Y. Lepage. Plausibility of word forms generated from analogical grids in Indonesian. In Proceedings of the 16th International Conference on Computer Applications (ICCA 2018), pages 179–184, Yangon, Myanmar, February 2018.
  3. R. Fam and Y. Lepage. Validating analogically generated Indonesian words using Fisher’s exact test. In Proceedings of the 24th Annual Meeting of the Japanese Association for Natural Language Processing, pages 312–315, Okayama, Japan, March 2018.
  4. Y. Lepage. Quasi-parallel corpora: Hallucinating translations for the Chinese–Japanese language pair (invited talk). In Proceedings of the 11th workshop on building and using comparable corpora (BUCC), colocated with LREC 2018, Miyazaki, Japan, May 2018
  5. R. Fam and Y. Lepage. Tools for the production of analogical grids and a resource of n-gram analogical grids in 11 languages. In Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC 2018), pages 1060–1066 Miyazaki, Japan, May 2018.

实验数据

Sentences: Europarl

Europarl平行语料库 (11种语言,共同部分,发布版本3)

使用英语句子提取了共同部分,以保证在所有11种语言中都有对应的翻译。提取的数据经过检查已被清除。

行数

训练集
347,614 行
发展集
500 行
测试集
38,123 行
参考
测试集的参考翻译.

所有语言各部分的单词数

  Train:
347,614
Dev:
500
Test:
38,123
Danish(da) 9,458,365 13,981 1,040,819
German(de) 9,510,833 14,033 1,046,557
Greek(el) 9,997,176 14,587 1,100,255
English(en) 9,945,267 14,612 1,094,082
Spanish(es) 10,472,178 15,398 1,151,404
Finnish(fi) 7,179,991 10,546 789,206
French(fr) 10,955,901 16,157 1,204,527
Italian(it) 9,880,314 14,611 1,085,840
Dutch(nl) 10,013,958 14,645 1,101,028
Portuguese(pt) 10,287,116 15,256 1,129,898
Swedish(sv) 8,988,906 13,243 988,588

 

Words: SIGMORPHON data set

Released data: SIGMORPHON 2016 - Analogies ( download )

We extract analogy questions from such data by considering all analogies of form filtered by morphological features. For each analogy, four different analogy questions can be asked, each of the four terms becoming the answer.

Caution: This is not the task proposed in SIGMORPHON Shared Task, which consists in a machine learning task: predicting a word form given a lemma and morphological features after having learnt from the training data.

Format:

The file contains one analogy per line.

A : B :: C : D

The lines introduced by a # give the name of the language of the analogies that follow.

# arabic/

yūnāniyyun : al-yūnāniyyatayni :: muʿāṣirun : al-muʿāṣiratayni

al-muʿāṣiratayni : al-yūnāniyyatayni :: muʿāṣirun : yūnāniyyun

al-yūnāniyyatayni : yūnāniyyun :: al-muʿāṣiratayni : muʿāṣirun

...

Original data: SIGMORPHON 2016 Shared Task: Morphological Reinflection Task 1 Track 1 (10 languages)

Publications:

If you are using this data, please cite our publication.

Y. Lepage. Character-Position Arithmetic for Analogy Questions between Word Forms. In Proceedings of the Computational Analogy Workshop at the 25th International Conference on Case-Based Reasoning (ICCBR-CAW-17), pages 17–26, Trondheim, Norway, June 2017.

 

实验设置

Word-to-word alignement tools
  • GIZA++ (Och and Ney, 2003)
  • Anymalign (Lardilleux and Lepage, 2009)
Translation table generation
GIZA++/Moses or Anymalign.
Experiments with statistical machine translation
  • training and decoding: Moses (Koehn et al., 2007),
  • tuning: MERT (Och, 2003),
  • language models: SRILM (Stolcke, 2002)
Experiments with the example-based approach
not an open tool, engine being developed at the EBMT/NLP lab.

实验结果

 

Resources: N-gram analogical clusters and grids

A resource of n-gram analogical clusters and grids extracted from the first thousand corresponding lines of the Europarl corpus v3. Please refer to our paper for further details.

Languages: Danish (da), German (de), Greek (el), English (en), Spanish (es), Finnish (fi), French (fr), Italian (it), Dutch (nl), Portuguese (pt), and Swedish (sv)

N-grams: 1-gram to 6-gram

Version: 1.0.0

Release Date: 2018-05-07

Last Updated: 2018-05-07

Download Links:

License

If you are using this data, please cite our publication.

R. Fam and Y. Lepage. Tools for the production of analogical grids and a resource of n-gram analogical grids in 11 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pages 1060–1066, Miyazaki, Japan, May 2018. ELRA. PDF

All resources on this page are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Creative Commons License

Nlg Module

Python 2 module for analogy. Please refer to readme file and our paper for installation, usages, etc.

  • Words2Vectors: Building a vector representation for words
  • Words2Clusters: Extraction of analogical clusters from a given set of words
  • Words2Grids: Construction of analogical grids from a given set of words

Version: 1.0.0

Release Date: 2018-05-07

Last Updated: 2018-05-07

Download Link: Nlg module

Dependencies:

 

License

If you use our module, please cite our paper:

R. Fam and Y. Lepage. Tools for the production of analogical grids and a resource of n-gram analogical grids in 11 languages. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), pages 1060–1066, Miyazaki, Japan, May 2018. ELRA. PDF

All resources on this page are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Creative Commons License

 

Questions / Contact

For questions regarding these resources, please contact Rashel Fam at fam(dot)rashel@fuji.waseda.jp.

Contact

EMBT / NLP Laboratory

Graduate School of Information,Production and Systems

Waseda University

2-7 Hibikino, Wakamatsu-ku,
Kitakyushu-shi, Fukuoka-ken, 808-0135, Japan

Tel/Fax: +81-93-692-5287