科学研究助学金概要
本奖学金旨在促进广泛的科学领域研究的创新性和开拓性,包括人文科学,社会科学和自然科学。 向日本大学和从事基础研究的研究机构的研究人员或组织提供项目资金支持,尤其是先进研究趋势相关领域。在这些资金支持下获得的研究成果广泛发表于学术期刊。
Kakenhi 23500187
该项目是题为“统计机器翻译和基于实例的机器翻译的改进以及多语言语法模式的发布” 的Kakenhi 项目。(Kiban C 23500187) 详情
目标
在以下几个方面改进翻译表的生成,主要采用基于子句采样对齐方法(Anymalign)
- 生成点对点翻译表
- 生成更长的多元翻译表
- 生成基于规则的翻译表
- Juan Luo, Jing Sun and Yves Lepage. Improving sampling-based alignment method for statistical machine translation tasks. 言語処理学会第17回年次大会, pp.186-189, Toyohashi, March 2011.
- Juan Luo, Adrien Lardilleux and Yves Lepage. Exploring n-grams distribution for sampling-based alignment. In Proceedings of the 5th Language and Technology Conference (LTC’11), pp.289-293, Poznan, Poland, November 2011.
- Juan Luo, Adrien Lardilleux and Yves Lepage. Improving sampling-based alignment by investigating the distribution of n-grams in phrase translation tables. In Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation (PACLIC 25), pp.150-159, Singapore, December 2011.
- Juan Luo, Jing Sun and Yves Lepage. Producing translation tables by separate n-grams subtables. 言語処理学会第18回年次大会, pp.797-800, Hiroshima, March 2012.
- J. Lee and Y. Lepage. Fast production of ad hoc translation tables using the sampling-based method. In Proceedings of the 18th Japanese National Conference in Natural Language Processing, pages 809–812, Hiroshima, March 2012.
- J. Luo, J. Sun, and Y. Lepage. Producing translation tables by separate N-grams subtables. In Proceedings of the 18th Japanese National Conference in Natural Language Processing, pages 797–800, Hiroshima, March 2012.
- J. Luo, A. Lardilleux, and Y. Lepage. Improving the distribution of N-grams in phrase tables obtained by the sampling-based method. Lecture Notes in Artificial Intelligence, ??:??–??, 2013.
- J. Luo and Y. Lepage. An investigation of the sampling-based alignment method and its contributions. International Journal of Artificial Intelligence & Applications (IJAIA), 4(4):9–19, July 2013.
- J. Luo and Y. Lepage. A comparison of association and estimation approaches to alignment in word-to- word translation. In Proceedings of the tenth international Symposium on Natural Language Processing (SNLP 2013), pages 181–186, Phuket, Thailand, October 2013. Phuket.
- T. Kimura, J. Matsuoka, Y. Nishikawa, and Y. Lepage. Analogy-based machine translation for longer sentences. In Proceedings of the 7th International Collaboration Symposium (IPS-ICS). IPS, Waseda university, November 2013.
- J. Luo, A. Max, and Y. Lepage. Using the productivity of language is rewarding for small data: Populating SMT phrase table by analogy. In Z. Vetulani, editor, Proceedings of the 6th Language & Technology Conference (LTC’13), pages 147–151, Poznan, December 2013. Fundacja uniwersytetu im. Adama Mickiewicza.
- T. Kimura, Y. Nishikawa, J. Matsuoka, and Y. Lepage. Generation of translation tables adequate for example-based machine translation by analogy. In Proceedings of the 2014 International Conference on Artificial Intelligence and Software Eng`ıneering (AISE2014), pages ??–??, Phuket, Thailand, January 2014. DESTech Publications.
- S. Zhang, J. Luo and Y. Lepage. Improving N-gram distribution for sampling-based alignment by extraction of longer N-grams. In Proceedings of the 215th Research Meeting in Natural Language Processing of the Japanese Information Processing Association. Tokyo, Japan, February 2014.
- T. Kimura, Y. Nishikawa, J. Matsuoka, and Y. Lepage. Ananlogy-based machine translation using secability. In Proceedings of the 2014 International Conference on Computational Science and Computational Intelligence (CSCI’2014), pages ??–??, Las Vegas, Nevada, USA, March 2014. IEEE Computer Society’s Conference Publishing Services.
- T. Kimura, J. Matsuoka, Y. Nishikawa, and Y. Lepage. Generation and assessment of translation tables for example-based machine translation by analogy (in Japanese). In Proceedings of the 16th Meeting of the Information Processing Society of Japan, pages ??–??, Tokyo, March 2014.
- T. Kimura, Y. Nishikawa, J. Matsuoka, and Y. Lepage. The influence of sentence length in example- based machine translation by analogy (in Japanese). In Proceeding of the 20th Yearly Conference of the Japanese Association for Natural Language Processing, pages ??–??, Sapporo, March 2014.
- Y. Nishikawa, T. Kimura, J. Matsuoka, and Y. Lepage. A study of analogy-based machine translation using monolingual or bilingual segmentation (in Japanese). In Proceeding of the 20th Yearly Conference of the Japanese Association for Natural Language Processing, pages ??–??, Sapporo, March 2014.
Experimental data
Europarl parallel corpus (11 languages, common part, release v3)
The common part was extracted using English sentences to determine the set of sentences that has a translation in all the 11 languages. The extracted data has been checked and cleaned up.
Number of lines
- Training data
- 347,614 lines
- Development set
- 500 lines
- Test set
- 38,123 lines
- References
- 1 reference per line in the test set.
Number of words in all sets in all languages
|
Train: 347,614
|
Dev: 500
|
Test: 38,123
|
Danish(da) |
9,458,365 |
13,981 |
1,040,819 |
German(de) |
9,510,833 |
14,033 |
1,046,557 |
Greek(el) |
9,997,176 |
14,587 |
1,100,255 |
English(en) |
9,945,267 |
14,612 |
1,094,082 |
Spanish(es) |
10,472,178 |
15,398 |
1,151,404 |
Finnish(fi) |
7,179,991 |
10,546 |
789,206 |
French(fr) |
10,955,901 |
16,157 |
1,204,527 |
Italian(it) |
9,880,314 |
14,611 |
1,085,840 |
Dutch(nl) |
10,013,958 |
14,645 |
1,101,028 |
Portuguese(pt) |
10,287,116 |
15,256 |
1,129,898 |
Swedish(sv) |
8,988,906 |
13,243 |
988,588 |
Experimental settings
- Word-to-word alignement tools
- GIZA++ (Och and Ney, 2003)
- Anymalign (Lardilleux and Lepage, 2009)
- Translation table generation
- GIZA++/Moses or Anymalign.
- Experiments with statistical machine translation
- training and decoding: Moses (Koehn et al., 2007),
- tuning: MERT (Och, 2003),
- language models: SRILM (Stolcke, 2002)
- Experiments with the example-based approach
- not an open tool, engine being developed at the EBMT/NLP lab.
Results of experiments with different translation tables (TTs)
Experiments with Moses decoder
Baseline experiments
- (tt_gen=giza++/moses, tt_type=phrases, tt_gen_option=w/o pruning, translation_engine=moses) GIZA++/MOSES standard pipeline, without pruning
- (tt_gen=giza++/moses, tt_type=phrases, option=with pruning, translation_engine=moses) GIZA++/MOSES standard pipeline with pruning
- (tt_gen=anymalign, tt_type=phrases, translation_engine=moses) same as above but the translation tables are the ones output by Anymalign with standard options
- (tt_gen=giza++/moses+anymalign, tt_type=phrases, translation_engine=moses, translation_option=merged_tables) Merged tables and Moses decoder for translation
- (tt_gen=giza++/moses+anymalign, tt_type=phrases, translation_engine=moses, translation_option=multiple_tables) Multiple phrase table with Moses decoder; experiments on some language pairs only
- (tt_gen=anymalign, tt_type=phrases, ttgen_option=Giza++/Moses_nbr_of_entries, translation_engine=moses) Anymalign forced to outpout the same number of entries in each n-gram x m-gram cell as in the TTs output by GIZA++/Moses
Improvements by alloting different time for the generation of N-grams x M-grams entries
- (tt_gen=anymalign, tt_type=phrases, ttgen_option=equal_time_distribution, translation_engine=moses) Anymalign with equal distribution of time for each n-gram x m-gram cells
- (tt_gen=anymalign, tt_type=phrases, ttgen_option=univariate_time_distribution, translation_engine=moses) Anymalign with standard normal distribution of time for n-gram x m-gram cells (= univariate time distribution)
- (tt_gen=anymalign, tt_type=phrases, ttgen_option=multivariate_time_distribution, translation_engine=moses) Anymalign with multivariate time distribution over n-gram x m-gram cells
Rule tables:
- (tt_gen=giza++/moses, tt_type=rules, translation_engine=moses) GIZA++/MOSES
- (tt_gen=anymalign, tt_type=rules, tt_gen_option=discontiguous_alignments, translation_engine=moses) Anymalign discontiguous entries filtered to generate rules, i.e., 0 to 2 placeholders
Experiments with an in-house analogy-based EBMT engine
Baseline experiments
- tt_gen=giza++/moses, tt_type=phrases, option=with pruning, translation_engine=ebmt
- Translation tables generated by the standard use of SMT pipeline GIZA++/Moses
- (tt_gen=anymalign, tt_type=phrases, translation_engine=ebmt) Translation tables generated by Anymalign (3 hours)
Improvements by use of better suited translation tables
- (tt_gen=secability, tt_type=phrases, translation_engine=ebmt) Word-to-word alignment using Anymalign and phrases output by secability in each language and phrase-to-phrase alignment generated by in-house alignment
- (tt_gen=lseq, tt_type=phrases, translation_engine=ebmt) Word-to-word alignment using Anymalign and phrases output by secability in source langage only and phrase-to-phrase alignment generated by in-house alignment
- (tt_gen=cutnalign, tt_type=phrases, translation_engine=ebmt) Word-to-word alignment using Anymalign and phrases output by cutnalign (simultaneous bilingual segmentation and alignment)