Select Language

AI社区

公开数据集

文字袋(包含文本和单词)数据集

文字袋(包含文本和单词)数据集

1.77G
474 浏览
0 喜欢
1 次下载
0 条讨论
NLP Classification

Data Set Information:For each text collection, D is the number of documents, W is the number of words in the vocabulary,......

数据结构 ? 1.77G

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    README.md

    Data Set Information:

    For each text collection, D is the number of documents, W is the number of words in the vocabulary, and N is the total number of words in the collection (below, NNZ is the number of nonzero counts in the bag-of-words).  After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times.  Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons.

    These data sets have no class labels, and for copyright reasons no filenames or other document-level metadata.  These data sets are ideal for clustering and topic modeling experiments.

    For each text collection we provide docword.*.txt (the bag of words file in sparse format) and vocab.*.txt (the vocab file).

    Enron Emails:
    orig source: www.cs.cmu.edu/~enron
    D=39861
    W=28102
    N=6,400,000 (approx)

    NIPS full papers:
    orig source: books.nips.cc
    D=1500
    W=12419
    N=1,900,000 (approx)

    KOS blog entries:
    orig source: dailykos.com
    D=3430
    W=6906
    N=467714

    NYTimes news articles:
    orig source: ldc.upenn.edu
    D=300000
    W=102660
    N=100,000,000 (approx)

    PubMed abstracts:
    orig source: www.pubmed.gov
    D=8200000
    W=141043
    N=730,000,000 (approx)


    Attribute Information:

    The format of the docword.*.txt file is 3 header lines, followed by
    NNZ triples:
    ---
    D
    W
    NNZ
    docID wordID count
    docID wordID count
    docID wordID count
    docID wordID count
    ...
    docID wordID count
    docID wordID count
    docID wordID count
    ---

    The format of the vocab.*.txt file is line


    Relevant Papers:

    N/A


    Citation Request:

    Please refer to the Machine Learning Repository's citation policy


    David Newman
    newman '@' uci.edu
    University of California, Irvine

    ×

    帕依提提提温馨提示

    该数据集正在整理中,为您准备了其他渠道,请您使用

    注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
    暂无相关内容。
    暂无相关内容。
    • 分享你的想法
    去分享你的想法~~

    全部内容

      欢迎交流分享
      开始分享您的观点和意见,和大家一起交流分享.
    所需积分:12 去赚积分?
    • 474浏览
    • 1下载
    • 0点赞
    • 收藏
    • 分享