EBMT / NLP Laboratory

Graduate School of Information, Production and Systems, Waseda University

Large-scale AMR graph dataset of more than 2 million sentences in the computational linguistics / natural language processing domain.

Download the data (.txt.gz files):

  1. a dataset of 2,643,682 sentences from ACL ARC with AMR graphs (very large file, 809 Mb, be patient)
  2. a dataset of 40,842 examples of sentences with AMR graphs and re-generated sentences (8.5 Mb)

For details about the creation of the data, refer to the following paper:

M. Zhao, Y. Wang, and Y. Lepage. Large-scale AMR corpus with re-generated sentences: domain adaptive pre-training on ACL Anthology Corpus. In Proceedings of the 14th International Conference on Advanced Computer Science and Information Systems (ICACSIS 2022), pages ??–??, 2022.

If you use the data, please quote the above paper.


EMBT / NLP Laboratory

Graduate School of Information,Production and Systems

Waseda University

2-7 Hibikino, Wakamatsu-ku,
Kitakyushu-shi, Fukuoka-ken, 808-0135, Japan

Tel/Fax: +81-93-692-5287