This project is a Kakenhi Project (Kakenhi Kiban C 21K12038) entitled "Theoretically founded algorithms for the automatic production of analogy test sets in NLP."
Background: Breakthroughs in NLP have led to vector representations for words and sentences via methods like word2vec, GloVe, BERT, GPT, and XLM-R. These models compute vectors for sentence parts in a shared space, evaluated through extrinsic tests like GLUE and SuperGLUE, primarily in English. Word embedding models are lighter to train than sentence models, often tackled by major institutions due to resource needs. Evaluation methods include analogies, seen as indicators of embedding space quality, tested in sets like Google's analogy test set and BATS for multiple languages, though resource-intensive.
Scientific and technical questions: The goal of the research project is to equip researchers in NLP with tools to explore vector representations of words or (part of) sentences so as to conduct intrinsic evaluation of vector representations. The practical results of the research project will be the release of tools which will automatically extract all analogies between all of the objects in a given vector space or a portion of this vector space. These tools will allow researchers to explore an entire given vector space of words or sentences, or a large portion of it. This will allow the automatic production of analogy test sets: it will become possible to conduct human verification of automatically produced candidate analogy test sets. The tools will lift two restrictions: they will apply to any vector space of any language, without restriction, and, any kind of analogies will be retrieved when exploring the entire space, without restriction, with a better balance between formal and semantic representations.
Core of research plan: The research project will address the following key scientific question: Given a set of vector representations of objects, obtained from distributional semantic methods or formal representations of words or sentences, how to extract all analogies from it? The project will explore solutions to extend a previous method developped in a previous project (Kakenhi Kiban C 15K00317) to real-valued vectors. This is not a trivial question. For the algorithm to work on real numbers, a proper definition of the relaxation of arithmetic analogies between numbers is needed. For that, theoretical work on the algebraic and analytic properties of analogy will be conducted. The risk is that the obtained methods lead to a large increase in computation time. The research project will explore methods to cast integer-valued string edit distances into real-valued vector representations.
References:
[1] X. Deng and Y. Lepage. Resolution of analogies between strings in the case of multiple solu- tions. In CEUR, editor, Proceedings of ICCBR: Workshop on Analogies: from Theory to Applications (ATA@ICCBR 2023), CEUR Workshop Proceedings, pages 3–14, July 2023.
[2] M. Eget, X. Yang, and Y. Lepage. A study in the generation of multilingually aligned middle sentences. In Z. Vetulani and P. Paroubek, editors, Proceedings of the 10th Language & Technology Conference (LTC 2023) – Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 45–49, April 2023.
[3] M. T. Eget, X. Yang, H. Xiao, and Y. Lepage. A study in the generation of multilingually aligned middle sentences. In Proceedings of the 16th International collaboration Symposium on Information, Production and Systems (ISIPS 2022), pages C2–1 (7678), IPS, Waseda university, nov 2022.
[4] R. Fam and Y. Lepage. A study of analogical density in various corpora at various granularity. Infor- mation, 12(8):no page number, 17 pages, Aug 2021.
[5] R. Fam and Y. Lepage. Organising lexica into analogical grids: a study of a holistic approach for morphological generation under various sizes of data in various languages. Journal of Experimental & Theoretical Artificial Intelligence, 36(1):1–26, 2022.
[6] R. Fam and Y. Lepage. Investigating parallelograms: Assessing several word embedding spaces against various analogy test sets in several languages using approximation. In Z. Vetulani and P. Paroubek, editors, Proceedings of the 10th Language & Technology Conference (LTC 2023) – Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 68–72, April 2023.
[7] R. Fam and Y. Lepage. Investigating parallelograms inside word embedding space using various analogy test sets in various languages. In Proceedings of the 29th Annual Meeting of the Japanese Association for Natural Language Processing, pages 718–722, Naha, Japan, March 2023.
[8] R. Fam and Y. Lepage. A study of universal morphological analysis using morpheme-based, holistic, and neural approaches under various data size conditions. Annals of Mathematics and Artificial Intelligence, ??(??):??–??, 2024.
[9] R. Hou, H. Liu, and Y. Lepage. Enhancing low-resource neural machine translation by using case-based reasoning. In Proceedings of the 17th International collaboration Symposium on Information, Production and Systems (ISIPS 2023), pages 25–29, IPS, Waseda university, nov 2023.
[10] Y. Lepage. Formulae for the solution of an analogical equation between Booleans using the Sheffer stroke (NAND) or the Pierce arrow (NOR). In M. Couceiro, P.-A. Murena, and S. Afantenos, editors, Proceedings of the Workshop Interactions between analogies and machine learning, colocated with IJCAI 2023 (IARML@IJCAI 2023), pages 3–14, August 2023.
[11] Y. Lepage and M. Couceiro. Analogie et moyenne g ́en ́eralis ́ee. In Actes de la conf ́erence Journ ́ees d’intelligence artificielle fran ̧caises – Plateforme fran ̧caise d’intelligence artificielle (PFIA-JIAF 2024), pages ??–??, La Rochelle, France, Juillet 2024.
[12] Y. Mei, R. Fam, and Y. Lepage. Extraction and comparison of analogical cluster sizes in different lan- guages for different vocabulary sizes. In Proceedings of the 15th International collaboration Symposium on Information, Production and Systems (ISIPS 2021), pages A1–6, IPS, Waseda university, nov 2021.
[13] L. Wang, Z. Pang, H. Wang, X. Zhao, and Y. Lepage. Solving sentence analogies by using embedding spaces combined with a vector-to-sequence decoder or by fine-tuning pre-trained language models. In Z. Vetulani and P. Paroubek, editors, Proceedings of the 10th Language & Technology Conference (LTC 2023) – Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 325–330, April 2023.
[14] L. Wang, H. Wang, and Y. Lepage. Continued pre-training on sentence analogies for translation with small data. In Proceedings of the 14th International Conference on Language Resources and Evaluation (LREC 2024) and the 30th International Conference on Computational Linguistics (COLING’24), pages ??–??, Turin, Italy, may 2024.
[15] B. Yan, H. Wang, L. Wang, Y. Zhou, and Y. Lepage. Transformer-based hierarchical attention models for solving analogy puzzles between longer, lexically richer and semantically more diverse sentences. In M. Couceiro, P.-A. Murena, and S. Afantenos, editors, Proceedings of the Workshop Interactions between analogies and machine learning, colocated with IJCAI 2024 (IARML@IJCAI 2024), pages ??–??, August 2024.
[16] Q. Zhang and Y. Lepage. Improving sentence embedding with sentence relationships from word analogies. In CEUR, editor, Proceedings of ICCBR: Workshop on Analogies: from Theory to Applications (ATA@ICCBR 2023), CEUR Workshop Proceedings, pages 43–53, July 2023.
Invited talks
[1] Analogie et données de langue, Colloquium LORIA, 15 novembre 2023, LORIA.
[2] Analogie, explication des données de langue et travaux récents sur représentations vectorielles de phrases et analogie, Workshop Analogies: From learning to explainability, Arras, 27–28 nov. 2023
[3] Analogie et moyenne : considérations générales et application aux chaînes, Forum sciences cognitives et traitement automatique des langues, Nancy, 29 nov. 2023
[4] Jeux d'analogies pour le TAL, exposé MALOTEC/LORIA, 13 décembre 2023
We provide below three data sets of analogies between sentences.
A resource of more than 22,000 semantico-formal analogies between sentences extracted from the English part of the Tatoeba corpus that exploits word analogies from the Google analogy test set.
Languages: English (en)
Type of data: Analogies between sentences
Format: Each line in the file is formatted in the following way:
sentence 1 \t sentence 2 \t sentence 3 \t sentence 4
where sentence 1 : sentence 2 :: sentence 3 : sentence 4 is a sentence analogy.
Examples:
Bamako is a superb city. Mali is a wonderful country. Bangkok is a superb city. Thailand is a wonderful country.
When did you get to Zagreb? When did you arrive in Croatia? When did you get to Bern? When did you arrive in Switzerland?
He was greatly respected, while his son was as much despised. She was greatly respected, while her daughter was as much despised. He received great respect, and his son also received contempt. She received great respect, and her daughter also received contempt.
He woke his son up for the fajr prayer. She woke her daughter up for the fajr prayer. He woke up his son and began to pray. She woke up her daughter and began to pray.
Version: 1.0.0
Release Date: 2023-05-02
Last Updated: 2023-05-02
Download Link:
If you use this data, please cite our publication.
B. Yan, H. Wang, L. Wang, Y. Zhou, and Y. Lepage. Transformer-based hierarchical attention models for solving analogy puzzles between longer, lexically richer and semantically more diverse sentences. In M. Couceiro, P.-A. Murena, and S. Afantenos, editors, Proceedings of the Workshop Interactions between analogies and machine learning, colocated with IJCAI 2024 (IARML@IJCAI 2024), pages ??–??, August 2024. Accepted, to appear.
If you use this Python package, please cite our publication.
Y. Lepage and M. Couceiro. Towards a unified framework of numerical analogies: Open questions and perspectives. In M. Couceiro, P.-A. Murena, and S. Afantenos, editors, Proceedings of the Workshop Interactions between analogies and machine learning, colocated with IJCAI 2024 (IARML@IJCAI 2024), pages ??–??, August 2024.. Accepted, to appear.
Download Link:
License: No commercial use allowed. Only for research purpose.
See the NLG package under this site: Projects > Kakenhi 15K00317 > Tools -- Nlg package