Report on extensive experiments here .
Europarl parallel corpus (11 languages, common part, release v3)
The common part was extracted using English sentences to determine the set of sentences that has a translation in all the 11 languages. The extracted data has been checked and cleaned up.
Training data: 347,614 lines
Development set: 500 lines
Test set: 38,123 lines
References: 1 reference per line in the test set.
Train: 347,614 |
Dev: 500 |
Test: 38,123 |
|
---|---|---|---|
Danish(da) | 9,458,365 | 13,981 | 1,040,819 |
German(de) | 9,510,833 | 14,033 | 1,046,557 |
Greek(el) | 9,997,176 | 14,587 | 1,100,255 |
English(en) | 9,945,267 | 14,612 | 1,094,082 |
Spanish(es) | 10,472,178 | 15,398 | 1,151,404 |
Finnish(fi) | 7,179,991 | 10,546 | 789,206 |
French(fr) | 10,955,901 | 16,157 | 1,204,527 |
Italian(it) | 9,880,314 | 14,611 | 1,085,840 |
Dutch(nl) | 10,013,958 | 14,645 | 1,101,028 |
Portuguese(pt) | 10,287,116 | 15,256 | 1,129,898 |
Swedish(sv) | 8,988,906 | 13,243 | 988,588 |