Select Language



路透社RCV1 RCV2多语言、多视图文本分类测试收集数据集

路透社RCV1 RCV2多语言、多视图文本分类测试收集数据集

640 浏览
0 喜欢
3 次下载
0 条讨论
Life Classification

Massih-Reza AminiUniversit?? Joseph FourierLaboratoire d'Informatique de GrenobleEmail : Massih-Reza.Amini '@......

数据结构 ? 159M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    Massih-Reza Amini
    Universit?? Joseph Fourier
    Laboratoire d'Informatique de Grenoble
    Email : Massih-Reza.Amini '@'

    Cyril Goutte
    National Research Council Canada
    Interactive Language Technology group
    Email : Cyril.Goutte '@'

    Data Set Information:

    Uncompressing rcv1rcv2aminigoutte.tar.bz2 will create a directory  that contains 5 subdirectories EN, FR, GR, IT and SP, corresponding to the 5 languages. Each subdirectory in {EN, FR, GR, IT, SP} contains 5 files, each containing indexes of the documents written or translated in that language.  For example, EN contains files:
    - Index_EN-EN : Original English documents
    - Index_FR-EN : French documents translated to English
    - Index_GR-EN : German documents translated to English
    - Index_IT-EN : Italian documents translated to English
    - Index_SP-EN : Spanish documents translated to English

    And similarly for the 4 other languages.

    Each file contains one indexed document per line, in a format similar to SVM_light.  Each line is of the form:

    Attribute Information:

    We focused on six relatively populous categories: C15, CCAT, E21, ECAT, GCAT, M11. For each language and each class, we sampled up to 5000 documents from the RCV1 (for English) or RCV2 (for other languages). documents belonging to more than one of our 6 classes were assigned the label of their smallest class.  This resulted in 12-30K documents per language, and 11-34K documents per class. The distribution of documents over languages and classes are:

                 Number of                   Vocabulary
    Language      documents     percentage       size
    ************  **********   ************  ************
    English        18,758         16.78        21,531
    French         26,648         23.45        24,893
    German         29,953         26.80        34,279
    Italian        24,039         21.51        15,506
    Spanish        12,342         11.46        11,547
    Total         111,740

    The distribution of classes in the whole collection is
              Number of                
    Class      documents     percentage  
    *********  **********   ************
    C15          18,816         16.84    
    CCAT         21,426         19.17    
    E21          13,701         12.26        
    ECAT         19,198         17.18        
    GCAT         19,178         17.16        
    M11          19,421         17.39

    In experiments that we conducted in cite{AUG09}, we considered each document available in a given language as the observed view for an example and all translated documents were used as the other views for that example, generated using Machine Translation. Results shown in this study were averaged over 10 random samples of 10 labeled examples per view for training, and 20% of the collection for testing.

    Relevant Papers:

    Massih-Reza Amini, Nicolas Usunier and Cyril Goutte. Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization. Advances in Neural Information Processing Systems 22, pp. 28-36, 2009

    Massih-Reza Amini and Cyril Goutte. A Co-classification Approach to Learning from Multilingual Corpora. Machine Learning Journal Springer, 79(1-2):105-121, 2010

    Abhishek Kumar, Hal Daum?? III. A co-training approach for multi-view spectral clustering. International Conference on Machine Learning, pp. 393-400. 2011

    Citation Request:

    If you publish results based on this data set, please acknowledge its use, by referring to:

    M.-R. Amini, N. Usunier, C. Goutte. Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization. Advances in Neural Information Processing Systems 22, p. 28-36, 2009




    • 分享你的想法


    所需积分:12 去赚积分?
    • 640浏览
    • 3下载
    • 0点赞
    • 收藏
    • 分享