公开数据集

Europarl用于统计机器翻译的并行语料库(21种欧洲语言的版本)

1.46G

1411 浏览

0 喜欢

0 次下载

0 条讨论

NLP Classification

For a detailed description of this corpus, please read:Europarl: A Parallel Corpus for Statistical Machine Translation,......

数据介绍
文件预览
相关论文
Code
分享讨论(0)
使用声明

启动Notebook开发

数据结构 ? 1.46G

* 以上分析是由系统提取分析形成的结果，具体实际数据为准。

README.md

For a detailed description of this corpus, please read:

Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005, pdf.

Please cite the paper, if you use this corpus in your work. See also the extended (but earlier) version of the report (ps,pdf).

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. For this purpose we extracted matching items and labeled them with corresponding document IDs. Using a preprocessor we identified sentence boundaries. We sentence aligned the data using a tool based on the Church and Gale algorithm.

Release v7

On 15 May 2012 we released a further expanded and improved version of the corpus. Previous versions are available here. The corpus is released as a source release with the document files and a sentence aligner, and parallel corpora of language pairs that include English.

Changes since v6

added 01/2011 - 11/2011 data, now up to around 60 million words per language
further refined preprocessing, cleaning

All formats contain document (), speaker (), and paragraph (

) mark-up on a separate line. The data is stored in one file per day, and in smaller units for newer data.

Some documents have the SPEAKER tag attribute LANGUAGE which indicates what language the original speaker was using.

To use the parallel corpora with tools like GIZA++, you want to:

tokenize the text (required)
lowercase the text (recommended)
strip empty lines and their correspondences (required)
remove lines with XML-Tags (starting with "<") (required)

Size of the Corpus

Sizes for single-language data after removing XML.

Language	Sentences	Words
Bulgarian	411,636	-
Czech	668,595	13,195,311
Danish	2,323,099	47,761,381
German	2,176,537	47,236,849
Greek	1,517,141	-
English	2,218,201	53,974,751
Spanish	2,123,835	54,806,927
Estonian	692,210	11,358,009
Finnish	2,119,515	33,708,706
French	2,190,579	54,202,850
Hungarian	658,824	12,606,986
Italian	2,081,669	50,259,169
Lithuanian	678,665	11,512,131
Latvian	666,026	12,085,228
Dutch	2,333,816	53,487,257
Polish	387,490	7,087,016
Portuguese	2,121,889	52,300,149
Romanian	402,904	9,663,544
Slovak	674,359	13,116,301
Slovene	634,488	12,665,974
Swedish	2,241,386	45,665,947

Sizes for parallel corpora after sentence aligning and removing XML.

Parallel Corpus (L1-L2)	Sentences	L1 Words	English Words
Bulgarian-English	406,934	-	9,886,291
Czech-English	646,605	12,999,455	15,625,264
Danish-English	1,968,800	44,654,417	48,574,988
German-English	1,920,209	44,548,491	47,818,827
Greek-English	1,235,976	-	31,929,703
Spanish-English	1,965,734	51,575,748	49,093,806
Estonian-English	651,746	11,214,221	15,685,733
Finnish-English	1,924,942	32,266,343	47,460,063
French-English	2,007,723	51,388,643	50,196,035
Hungarian-English	624,934	12,420,276	15,096,358
Italian-English	1,909,115	47,402,927	49,666,692
Lithuanian-English	635,146	11,294,690	15,341,983
Latvian-English	637,599	11,928,716	15,411,980
Dutch-English	1,997,775	50,602,994	49,469,373
Polish-English	632,565	12,815,544	15,268,824
Portuguese-English	1,960,407	49,147,826	49,216,896
Romanian-English	399,375	9,628,010	9,710,331
Slovak-English	640,715	12,942,434	15,442,233
Slovene-English	623,490	12,525,644	15,021,497
Swedish-English	1,862,234	41,508,712	45,703,795

Test Sets

Several test sets have been released for the Europarl corpus. In general, the Q4/2000 portion of the data (2000-10 to 2000-12) should be reserved for testing. All released test sets have been selected from this quarter. The shared tasks for the 2006 and 2007 ACL Workshops on Statistical Machine Translation provide test sets from the Europarl corpus.

The original common test set from the Koehn/Och/Marcu ACL 2003 Paper is available in the archives.

Extended versions of these test sets are available in the evaluation Matrix of the EuroMatrix project.

Known Bugs

Some special HTML entities and noisy characters are not removed from the data.
Some recent Greek data has only parts of transcripts in the files.

Terms of Use

We are not aware of any copyright restrictions of the material. If you use this data in your research, please contact pkoehn@inf.ed.ac.uk. Please let us know if you find problems with the data or if you want the data for other language pairs. We recommend using the last quarter of 2000 for testing (2000-10 until 2000-12) for consistency in reporting research results on this data.

Acknowledgments

The work was in part supported by the EuroMatrixPlus project funded by the European Commission (7th framework Programme).

暂无相关内容。

分享你的想法

去分享你的想法~~

全部内容

欢迎交流分享

开始分享您的观点和意见，和大家一起交流分享.

数据使用声明：

一、数据来源与展示说明：

1、该数据来自于互联网数据采集或服务商的提供，本平台为用户提供数据集的展示与浏览。
2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
3、数据集基本信息来自数据原地址或数据提供方提供的信息，如数据集描述中有描述差异，请以数据原地址或服务商原地址为准。

二、所有权说明：

1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。

三、数据转载说明：

1、如您需要转载本站数据，请保留原数据地址及相关版权声明。

四、侵权与处理说明：

1、如本站中的部分数据涉及侵权展示，请及时联系本站，我们会安排进行数据下线。

所需积分：

10 去赚积分？

1411浏览
0下载
0点赞
收藏
分享

今日排行

本月搜索

Dataset Category