公开数据集

IMDB 50K电影评论（测试你的BERT）

62.91M

359 浏览

0 喜欢

0 次下载

0 条讨论

Arts and Entertainment,Internet,Movies and TV Shows,NLP,Text Data,Art Classification

数据介绍
文件预览
相关论文
Code
分享讨论(0)
使用声明

启动Notebook开发

数据结构 ? 62.91M

* 以上分析是由系统提取分析形成的结果，具体实际数据为准。

README.md

Context **`Large Movie Review Dataset v1.0`** . ?? ![IMDB wall](https://static.amazon.jobs/teams/53/images/IMDb_Header_Page.jpg?1501027252) This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. Provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorising movie-unique terms and their associated with observed labels. In the labelled train/test sets, a `negative` review has a **score <= 4 out of 10**, and a `positive` review has a **score >= 7 out of 10**. Thus reviews with more neutral ratings are not included in the train/test sets. In the unsupervised set, reviews of any rating are included and there are an even number of reviews **> 5 and <= 5**. **`Reference:`** http://ai.stanford.edu/~amaas/data/sentiment/ ***NOTE*** **`A starter kernel is here :`** https://www.kaggle.com/atulanandjha/bert-testing-on-imdb-dataset-starter-kernel **`A kernel to expose Dataset collection :`** Content Now let’s understand the task in hand: given a movie review, predict whether it’s `positive` or `negative`. The dataset we use is **50,000 IMDB** reviews (**25K for train and 25K for test**) from the **PyTorch-NLP** library. Each review is tagged **pos** or **neg** . There are **50% positive** reviews and **50% negative** reviews both in train and test sets. Columns: `text :` Reviews from people. `Sentiment :` Negative or Positive tag on the review/feedback (Boolean). Acknowledgements **When using this Dataset Please `Cite` this ACL paper using :** > @InProceedings{ > maas-EtAl:2011:ACL-HLT2011, > author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, > title = {Learning Word Vectors for Sentiment Analysis}, > booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, > month = {June}, > year = {2011}, > address = {Portland, Oregon, USA}, > publisher = {Association for Computational Linguistics}, > pages = {142--150}, > url = {http://www.aclweb.org/anthology/P11-1015} > } **Link to ref Dataset:** https://pytorchnlp.readthedocs.io/en/latest/_modules/torchnlp/datasets/imdb.html https://www.samyzaf.com/ML/imdb/imdb.html Inspiration BERT and other Transformer Architecture models have always been on hype recently due to a great breakthrough by introducing Transfer Learning in NLP. So, Let's use this simple yet efficient Data-set to Test these models, and also compare our results with theirs. Also, I invite fellow researchers to try out their State of the Art Algorithms on this data-set.

暂无相关内容。

分享你的想法

去分享你的想法~~

全部内容

欢迎交流分享

开始分享您的观点和意见，和大家一起交流分享.

数据使用声明：

一、数据来源与展示说明：

1、该数据来自于互联网数据采集或服务商的提供，本平台为用户提供数据集的展示与浏览。
2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
3、数据集基本信息来自数据原地址或数据提供方提供的信息，如数据集描述中有描述差异，请以数据原地址或服务商原地址为准。

二、所有权说明：

1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。

三、数据转载说明：

1、如您需要转载本站数据，请保留原数据地址及相关版权声明。

四、侵权与处理说明：

1、如本站中的部分数据涉及侵权展示，请及时联系本站，我们会安排进行数据下线。

所需积分：

0 去赚积分？

359浏览
0下载
0点赞
收藏
分享

Select Language

AI社区

今日排行

本月搜索

Dataset Category