Dataset creator and donator: ZhiLiu, e-mail: liuzhi8673 '@' gmail.com, institution: National Engineering Research Center for E-Learning, Hubei Wuhan, China
Data Set Information:
The dataset is the subset of RCV1. These corpus has already been used in author identification experiments. In the top 50 authors (with respect to total size of articles) were selected. 50 authors of texts labeled with at least one subtopic of the class CCAT(corporate/industrial) were selected.That way, it is attempted to minimize the topic factor in distinguishing among the texts. The training corpus consists of 2,500 texts (50 per author) and the test corpus includes other 2,500 texts (50 per author) non-overlapping with the training texts.
Attributes of the dataset are character n-grams(n=1-5)
J. Houvardas, E. Stamatatos, a€?N-gram Feature Selection for Authorship Identification,a€? in Proc. of the 12th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications, vol. 4183, pp.77-86, (2006) September 12-15; Varna, Bulgaria.
E. Stamatatos, a€?Author Identification Using Imbalanced and Limited Training Texts,a€? In Proc. of the 4th International Workshop on Text-based Information Retrieval, (2007) September 3-7; Regensburg, Germany.
Please refer to the donator Zhi Liu from National Engineering Research Center For E-Learning Technology???China.