Select Language

AI社区

公开数据集

博客评论数据集

博客评论数据集

2.5M
388 浏览
0 喜欢
1 次下载
0 条讨论
Social Regression

Data Set Information:这些数据来源于博客文章。对博客帖子的原始HTML文档进行了爬网和处理。与数据相关联的预测任务是预测未来2......

数据结构 ? 2.5M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    README.md

    Data Set Information:

    这些数据来源于博客文章。对博客帖子的原始HTML文档进行了爬网和处理。与数据相关联的预测任务是预测未来24小时内的评论数量。整齐为了模拟这种情况,我们选择一个基准时间(过去)并选择在所选基准日期/时间之前最多72小时发布的博客文章。然后,我们根据basetime提供的信息计算所选博客文章的所有特征,因此每个实例对应一篇博客文章。目标是博客文章在未来24小时内相对于基准时间收到的评论数。

    In the train data, the basetimes were in the years 2010 and 2011. In the test data the basetimes were in February and March 2012. This simulates the real-world situtation in which training data from the past is available to predict events in the future.

    The train data was generated from different basetimes that may temporally overlap. Therefore, if you simply split the train
    into disjoint partitions, the underlying time intervals may overlap. Therefore, the you should use the provided, temporally
    disjoint train and test splits in order to ensure that theevaluation is fair.


    Attribute Information:

    1...50:
         Average, standard deviation, min, max and median of the
         Attributes 51...60 for the source of the current blog post
         With source we mean the blog on which the post appeared.
         For example, myblog.blog.org would be the source of
         the post myblog.blog.org/post_2010_09_10
    51:   Total number of comments before basetime
    52:   Number of comments in the last 24 hours before the
         basetime
    53:   Let T1 denote the datetime 48 hours before basetime,
         Let T2 denote the datetime 24 hours before basetime.
         This attribute is the number of comments in the time period
         between T1 and T2
    54:   Number of comments in the first 24 hours after the
         publication of the blog post, but before basetime
    55:   The difference of Attribute 52 and Attribute 53
    56...60:
         The same features as the attributes 51...55, but  
         features 56...60 refer to the number of links (trackbacks),
         while features 51...55 refer to the number of comments.
    61:   The length of time between the publication of the blog post
         and basetime
    62:   The length of the blog post
    63...262:
         The 200 bag of words features for 200 frequent words of the
         text of the blog post
    263...269: binary indicator features (0 or 1) for the weekday
         (Monday...Sunday) of the basetime
    270...276: binary indicator features (0 or 1) for the weekday
         (Monday...Sunday) of the date of publication of the blog
         post
    277:  Number of parent pages: we consider a blog post P as a
         parent of blog post B, if B is a reply (trackback) to
         blog post P.
    278...280:  
         Minimum, maximum, average number of comments that the
         parents received
    281:  The target: the number of comments in the next 24 hours
         (relative to basetime)


    Relevant Papers:

    Buza, K. (2014). Feedback Prediction for Blogs. In Data Analysis, Machine Learning and Knowledge Discovery (pp. 145-152). Springer International Publishing.



    Citation Request:

    Buza, K. (2014). Feedback Prediction for Blogs. In Data Analysis, Machine Learning and Knowledge Discovery (pp. 145-152). Springer International Publishing.


    Krisztian Buza
    Budapest University of Technology and Economics
    buza '@' cs.bme.hu
    http://www.cs.bme.hu/~buza

    ×

    帕依提提提温馨提示

    该数据集正在整理中,为您准备了其他渠道,请您使用

    注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
    暂无相关内容。
    暂无相关内容。
    • 分享你的想法
    去分享你的想法~~

    全部内容

      欢迎交流分享
      开始分享您的观点和意见,和大家一起交流分享.
    所需积分:10 去赚积分?
    • 388浏览
    • 1下载
    • 0点赞
    • 收藏
    • 分享