Select Language

AI社区

公开数据集

TMDB 5000 电影数据集

TMDB 5000 电影数据集

43.62M
181 浏览
0 喜欢
3 次下载
0 条讨论
Arts and Entertainment,Movies and TV Shows Classification

数据结构 ? 43.62M

    Data Structure ?

    * 以上分析是由系统提取分析形成的结果,具体实际数据为准。

    README.md

    Background What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success? This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films. Data Source Transfer Summary We (Kaggle) have removed the original version of this dataset per a [DMCA](https://en.wikipedia.org/wiki/Digital_Millennium_Copyright_Act) takedown request from IMDB. In order to minimize the impact, we're replacing it with a similar set of films and data fields from [The Movie Database (TMDb)](themoviedb.org) in accordance with [their terms of use](https://www.themoviedb.org/documentation/api/terms-of-use). The bad news is that kernels built on the old dataset will most likely no longer work. The good news is that: - You can port your existing kernels over with a bit of editing. [This kernel](https://www.kaggle.com/sohier/getting-imdb-kernels-working-with-tmdb-data/) offers functions and examples for doing so. You can also find [a general introduction to the new format here](https://www.kaggle.com/sohier/tmdb-format-introduction). - The new dataset contains full credits for both the cast and the crew, rather than just the first three actors. - Actor and actresses are now listed in the order they appear in the credits. It's unclear what ordering the original dataset used; for the movies I spot checked it didn't line up with either the credits order or IMDB's stars order. - The revenues appear to be more current. For example, IMDB's figures for Avatar seem to be from 2010 and understate the film's global revenues by over $2 billion. - Some of the movies that we weren't able to port over (a couple of hundred) were just bad entries. For example, [this IMDB entry](http://www.imdb.com/title/tt5289954/?ref_=fn_t...) has basically no accurate information at all. It lists Star Wars Episode VII as a documentary. Data Source Transfer Details - Several of the new columns contain json. You can save a bit of time by porting the load data functions [from this kernel](). - Even in simple fields like runtime may not be consistent across versions. For example, previous dataset shows the duration for Avatar's extended cut while TMDB shows the time for the original version. - There's now a separate file containing the full credits for both the cast and crew. - All fields are filled out by users so don't expect them to agree on keywords, genres, ratings, or the like. - Your existing kernels will continue to render normally until they are re-run. - If you are curious about how this dataset was prepared, the code to access TMDb's API is posted [here](https://gist.github.com/SohierDane/4a84cb96d220fc4791f52562be37968b). New columns: - homepage - id - original_title - overview - popularity - production_companies - production_countries - release_date - spoken_languages - status - tagline - vote_average Lost columns: - actor_1_facebook_likes - actor_2_facebook_likes - actor_3_facebook_likes - aspect_ratio - cast_total_facebook_likes - color - content_rating - director_facebook_likes - facenumber_in_poster - movie_facebook_likes - movie_imdb_link - num_critic_for_reviews - num_user_for_reviews Open Questions About the Data There are some things we haven't had a chance to confirm about the new dataset. If you have any insights, please let us know in the forums! - Are the budgets and revenues all in US dollars? Do they consistently show the global revenues? - This dataset hasn't yet gone through a data quality analysis. Can you find any obvious corrections? For example, in the IMDb version it was necessary to treat values of zero in the budget field as missing. Similar findings would be very helpful to your fellow Kagglers! (It's probably a good idea to keep treating zeros as missing, with the caveat that missing budgets much more likely to have been from small budget films in the first place). Inspiration - Can you categorize the films by type, such as animated or not? We don't have explicit labels for this, but it should be possible to build them from the crew's job titles. - How sharp is the divide between major film studios and the independents? Do those two groups fall naturally out of a clustering analysis or is something more complicated going on? Acknowledgements This dataset was generated from [The Movie Database](themoviedb.org) API. This product uses the TMDb API but is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows. You can [try it for yourself here](https://www.themoviedb.org/documentation/api). ![](https://www.themoviedb.org/assets/static_cache/9b3f9c24d9fd5f297ae433eb33d93514/images/v4/logos/408x161-powered-by-rectangle-green.png)
    ×

    帕依提提提温馨提示

    该数据集正在整理中,为您准备了其他渠道,请您使用

    注:部分数据正在处理中,未能直接提供下载,还请大家理解和支持。
    暂无相关内容。
    暂无相关内容。
    • 分享你的想法
    去分享你的想法~~

    全部内容

      欢迎交流分享
      开始分享您的观点和意见,和大家一起交流分享.
    所需积分:0 去赚积分?
    • 181浏览
    • 3下载
    • 0点赞
    • 收藏
    • 分享