Sentiment polarity datasets
Pool of 27886 unprocessed html files (81.1Mb) from which the polarity dataset v2.0 was derived. (This file is identical to movie.zip from data release v1.0.)
polarity dataset v1.1 (2.2Mb) (includes README.1.1): approximately 700 positive and 700 negative processed reviews. Released November 2002. This alternative version was created by Nathan Treloar, who removed a few non-English/incomplete reviews and changing some of the labels (judging some polarities to be different from the original author's rating). The complete list of changes made to v1.1 can be found in diff.txt.
polarity dataset v0.9 (2.8Mb) (includes a README):. 700 positive and 700 negative processed reviews. Introduced in Pang/Lee/Vaithyanathan EMNLP 2002. Released July 2002. Please read the "Rating Information - WARNING" section of the README.
movie.zip (81.1Mb): all html files we collected from the IMDb archive.
Sentiment scale datasets
Sep 30, 2009: Yanir Seroussi points out that due to some misformatting in the raw html files, six reviews are misattributed to Dennis Schwartz (29411 should be Max Messier, 29412 should be Norm Schrager, 29418 should be Steve Rhodes, 29419 should be Blake French, 29420 should be Pete Croatto, 29422 should be Rachel Gordon) and one (23982) is blank.
Pool of unprocessed source documents (9.3Mb) from which the sentences in the subjectivity dataset v1.0 were extracted. Note: On April 2, 2012, we replaced the original gzipped tarball with one in which the subjective files are now in the correct directory (so that the subjectivity directory is no longer empty; the subjective files were mistakenly placed in the wrong directory, although distinguishable by their different naming scheme).