EBMT / NLP Laboratory

Graduate School of Information, Production and Systems, Waseda University

Korean L2 Unknown Words

Labeled Dataset

Description: A dataset of L2 learners (annotators) and Korean words that have been labeled as known/unknown.  Suitable for training unknown word models.  See our paper and the download's readme for details.

Version: 1.1.0

Release Date: 2017-09-27

Last Updated: 2018-05-06

Download Link:  labeled_dataset_v1_1_0.zip

Annotated Corpus

Description: The original annotation data, published as a standoff annotated corpus.  This file is useful if your study requires the context of the annotated words, or if you need to process the annotations differently than how we did when we produced the labeled dataset.  Using this file requires a little more work, however, because the original corpus must be downloaded separately and then merged with these standoff annotations.  We provide an automatic Python script for doing this in the download.  See the download's readme for details.  Further documentation can be found in our paper.

Version: 1.0.0

Release Date: 2017-10-10

Last Updated: 2017-10-10

Download Link:  annotated_corpus_v1_0_0.zip

TOPIK Base Word Frequency Stats

Description: These are the word frequencies and document frequencies extracted from the TOPIK graded corpus and used to compute some of the word features for the SVM models described in our paper.

Version: 1.0.0

Release Date: 2018-05-06

Last Updated: 2018-05-06

Download Link:  TOPIK_baseword_stats_v1_0_0.zip


All resources on this page are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License*. If you use our data, please cite our paper:

Kevin P. Yancey, Yves Lepage. Korean L2 Vocabulary Prediction: Can a Large Annotated Corpus be Used to Train Better Models for Predicting Unknown Words? In Proceedings of the 11th edition of the Language Resources and Evaluation Conference, pages 438-445, Mizazaki, Japan, May 2018.

Creative Commons License

* Note: This license does not cover the original TOPIK exam texts, which may be downloaded separately the TOPIK website. Permission to use the text of the original TOPIK exams in published research may be obtained by contacting the National Institute for International Education (NIIED).  See the readme included with the Annotated Corpus download for further details.

Questions / Contact

For questions regarding these resources, please contact Kevin P. Yancey at kpyancey@fuji.waseda.jp.


EMBT / NLP Laboratory

Graduate School of Information,Production and Systems

Waseda University

2-7 Hibikino, Wakamatsu-ku,
Kitakyushu-shi, Fukuoka-ken, 808-0135, Japan

Tel/Fax: +81-93-692-5287