This project is a Kakenhi Project (Kakenhi Kiban C 18K11447) entitled " Self-explainable and fast-to-train example-based machine translation using neural networks."
Background: On the contrary to the rule-based approach to machine translation (MT), the data-oriented approach relies on aligned data in two languages called bilingual corpora. There have been three trends in the data-oriented approach to MT: example-based (EBMT), statistical (SMT), and neural (NMT). SMT and NMT are eager-learning methods: a long time is required to train models from bilingual corpora. EBMT is lazy-learning: less time is spent to extract knowledge in advance, but typically more time is spent during translation.
Scientific and technical questions: Google, Baidu etc. have developped all-purpose general machine translation systems in the eager-learning data-oriented approaches. Thanks to recent advances in neural networks, Google, Baidu etc. and the translation industry have already moved or are currently moving to NMT. Neural networks require long training times and specialised machines (GPUs). The translation industry or translation services in large companies [Pyne, 2017] develop customized MT systems based on in-house translation memories because the style and the domain of the translated documents are specific. Such systems are integrated with translation memory software. Fast domain adaptation, necessary in the translation industry for individual professional translators, is difficult with neural networks because training a new system on a new domain requires time and resources. Neural networks are also blackboxes. Errors (typically repeated words or untranslated words) are difficult to explain because the weights in a neural network are difficult to interpret. Why and how pieces of sentences are translated cannot be easily traced. The AI community is aware of the general issue. Explainable artificial intelligence (XAI) [Gunning, 2017] has been proposed (also [Sekine, 2017]).
Core of research plan: This research project will address the issue of long training times and resource-greedy training and the blackbox issue in the data-oriented approach to MT. It will combine the use of techniques from the three data-oriented approaches to MT: from NMT, the use of deep learning techniques for natural language processing, namely continuous representations of words (word embeddings, distributional semantics); from SMT, the use of soft sub-sentential alignments [Dyer et al., 2012], [20]; from EBMT, the use of traces in recursive translation by analogy so as to explain why and how pieces of sentences are translated. The key research question is to propose solutions to adapt the example-based framework of machine translation by analogy to soft alignments of continuous representations of words.
References: [Dyer et al., 2012] A simple, fast, and effective reparameterization of IBM model 2, NAACL. [Gunning, 2017] Explainable Artificial Intelligence, web page. [Pyne, 2017] Introduction of MT into industrial-scale translation workflows with translator acceptance, MT Summit XVI. [Sumita & Iida, 1992] Experiments and prospects of example-based machine translation, ACL. [Sato, 1991] Example-based machine translation, PhD thesis. [Sekine, 2017] 理化学研究所、言語情報アクセ ス研究チーム, web page. [Takezawa et al., 2003] Toward a broad coverage bilingual corpus for speech translation of travel conversation in the real world, LREC.
We provide below three data sets of analogies between sentences.
A resource of 5,607 semantico-formal analogies between sentences extracted from the English part of the Tatoeba corpus, and similarly 5,296 analogies in French from Tatoeba also. Please refer to our papers for further details.
Languages: English (en) and French (fr)
Type of data: Analogies between short sentences
Format: Each line in the file is formatted in the following way:
sentence 1 : sentence 2 :: sentence 3 : x => x = sentence 4
meaning that sentence 4 has been obtained by solving the analogical equation sentence 1 : sentence 2 :: sentence 3 : x, where sentences 1, 2 and 3 are from the Tatoeba corpus. To ensure that sentence 4 is valid, the file contains only those analogies where sentence 4 was found in the Tatoeba corpus.
Examples:
There 's hardly any coffee left in the pot . : There 's almost no coffee left in the pot . :: There 's hardly any water in the bucket . : x  =>  x = There 's almost no water in the bucket .
My watch loses five minutes a day . : My watch loses ten minutes a day . :: My watch loses two minutes a day . : x  =>  x = My watch loses three minutes a day .
I do not want to fight you . : I do not want to see you . :: I do not want to overreact . : x  =>  x = I do not want to know .
Version: 1.0.0
Release Date: 2019-04-18
Last Updated: 2019-04-18
Download Link:
If you are using this data, please cite our publication.
Y. Lepage. Semantico-formal resolution of analogies between sentences. In Proceedings of the 9th Language and Technology Conference (LTC 2019), pages 57–61, Poznan, Poland, May 2019. [PDF]
Y. Lepage. Analogies between short sentences: a semantico-formal approach. In LNAI Series, Post-LTC volume, Springer, to appear.
A resource of 5,607 analogies between short English sentences. The sentences were obtained as results of a neural-based sentence analogy solver on 20 % of the 5,607 semantico-formal analogies of the first resource, while the remaining 80 % semantico-formal analogies were used to train the neural model (60 % for training set) and to stop learning (20 % for development set). This was done 5 times to get all 5,607 sentences. Please refer to our paper for further details.
Language: English (en)
Type of data: Analogies between short sentences converted to lowercase
Format: Each line has the following format:
sentence 1 : sentence 2 :: sentence 3 : x => x = sentence 4 (hypothesis) \t sentence 5 (reference)
where sentence 4 is the solution generated by our neural model for the analogical equation consisting of three known sentences (sentences 1, 2 and 3). Sentence 5 is the reference answer to the identical equation in the released set, i.e. the sentence in the first resource, obtained by semantico-formal resolution of the analogy.
Examples:
Version: 1.0.0
Release Date: 2021-04-16
Last Updated: 2021-04-16
Download Link:
License: All resources on this page are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.If you are using this data, please cite our publication.
L. Wang and Y. Lepage. Vector-to-sequence models for sentence analogies. In IEEE, editor, Proceedings of the 2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS 2020), pages 441–446, October 2020. [PDF] (Session best presentation award and conference Best paper award)
A resource of 1,309,844 analogies between sentences for the purpose of example-based machine translation (EBMT). The translations were obtained as results of neural-based EBMT on 20 % of the 1,309,844 bilingual analogies, while the remaining 80 % bilingual analogies were used to train the neural model (60 % for training set) and to stop learning (20 % for development set). This was done 5 times to get all 1,309,844 translations. Please refer to our paper for futher details.
Language: English (en), French (fr)
Type of data: Analogies between sentences in EBMT (bilingual set of analogies between sentences)
Format: Each line consists of two analogies and one sentence, has the following format:
A : B :: C : D \t A'_hyp (hypothesis) : B' :: C' : D' \t A'_ref (reference)
Sentences A, B, C and D form an analogy in the source language (en), where A is the input sentence to be translated. A'_hyp is the translation of A, generated by our neural model for solving the analogy consisting of three target-language (fr) sentences corresponding to B, C, and D in parallel aligned corpus. A'_ref is the reference answer to the translation.
Examples:
it 's snowing . : it is snowing . :: tom 's sweating . : tom is sweating .  \t il neige . : il neige . :: tom transpire . : tom transpire .  \t  il neige .
we 're punctual . : you 're punctual . :: we 're too weak . : you 're too weak .  \t nous sommes ponctuels . : tu es ponctuelle . :: nous sommes trop faibles . : tu es trop faible . \t nous sommes ponctuelles .
Version: 1.0.0
Release Date: 2021-06-17
Last Updated: 2021-06-17
Download Link:
License: All resources on this page are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.If you are using this data, please cite our publication.
V. Taillandier, L. Wang et Y. Lepage. Réseaux de neurones pour la résolution d’analogies entre phrases en traduction automatique par l’exemple. In Actes de la 6e conférence conjointe Journées d’Etudes sur la Parole (JEP, 31e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Etudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL, 22e édition), volume 2 : Traitement Automatique des Langues Naturelles, pages 108–121. AFCP et ATALA, Mai 2020. [PDF] (Prix du meilleur article TALN -- Best paper award)
We provide a simple Python program to read in the data. Type
on a command line to get help.
Download Link:
Three resources in English. Sentences with sentences from the same corpus that cover them.