Select Language


RCV1的子集数据集 该语料库已经用于作者识别实验

RCV1的子集数据集 该语料库已经用于作者识别实验



Data Type:

所需积分:6 去赚积分?
  • 284浏览
  • 0下载
  • 0点赞
  • 收藏
  • 分享

Data Preview ? 7.8M

    Data Structure ?


    Dataset creator and donator: ZhiLiu, e-mail: liuzhi8673 '@', institution: National Engineering Research Center for E-Learning, Hubei Wuhan, China

    Data Set Information:

    The dataset is the subset of RCV1. These corpus has already been used in author identification experiments. In the top 50 authors (with respect to total size of articles) were selected. 50 authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected.That way, it is attempted to minimize the topic factor in distinguishing among the texts. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts.

    Attribute Information:

    Attributes of the dataset are character n-grams(n=1-5)

    Relevant Papers:

    J. Houvardas, E. Stamatatos, a€?N-gram Feature Selection for Authorship Identification,a€? in Proc. of the 12th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications, vol. 4183, pp.77-86, (2006) September 12-15; Varna, Bulgaria.
    E. Stamatatos, a€?Author Identification Using Imbalanced and Limited Training Texts,a€? In Proc. of the 4th International Workshop on Text-based Information Retrieval, (2007) September 3-7; Regensburg, Germany.

    Citation Request:

    Please refer to the donator Zhi Liu from National Engineering Research Center For E-Learning Technology???China.