公开数据集

推特用户性别分类

7.8M

502 浏览

0 喜欢

0 次下载

0 条讨论

Internet,Online Communities,Social Networks,Gender Classification

数据介绍
文件预览
相关论文
Code
分享讨论(0)
使用声明

启动Notebook开发

数据结构 ? 7.8M

* 以上分析是由系统提取分析形成的结果，具体实际数据为准。

README.md

This data set was used to train a CrowdFlower AI gender predictor. [You can read all about the project here](https://www.crowdflower.com/using-machine-learning-to-predict-gender/). Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color. ## Inspiration Here are a few questions you might try to answer with this dataset: - how well do words in tweets and profiles predict user gender? - what are the words that strongly predict male or female gender? - how well do stylistic factors (like link color and sidebar color) predict user gender? ## Acknowledgments Data was provided by the [Data For Everyone Library](https://www.crowdflower.com/data-for-everyone/) on [Crowdflower](https://www.crowdflower.com). Our Data for Everyone library is a collection of our favorite open data jobs that have come through our platform. They're available free of charge for the community, forever. ## The Data The dataset contains the following fields: - **_unit_id**: a unique id for user - **_golden**: whether the user was included in the gold standard for the model; TRUE or FALSE - **_unit_state**: state of the observation; one of *finalized* (for contributor-judged) or *golden* (for gold standard observations) - **_trusted_judgments**: number of trusted judgments (int); always 3 for non-golden, and what may be a unique id for gold standard observations - **_last_judgment_at**: date and time of last contributor judgment; blank for gold standard observations - **gender**: one of *male*, *female*, or *brand* (for non-human profiles) - **gender:confidence**: a float representing confidence in the provided gender - **profile_yn**: "no" here seems to mean that the profile was meant to be part of the dataset but was not available when contributors went to judge it - **profile_yn:confidence**: confidence in the existence/non-existence of the profile - **created**: date and time when the profile was created - **description**: the user's profile description - **fav_number**: number of tweets the user has favorited - **gender_gold**: if the profile is golden, what is the gender? - **link_color**: the link color on the profile, as a hex value - **name**: the user's name - **profile_yn_gold**: whether the profile y/n value is golden - **profileimage**: a link to the profile image - **retweet_count**: number of times the user has retweeted (or possibly, been retweeted) - **sidebar_color**: color of the profile sidebar, as a hex value - **text**: text of a random one of the user's tweets - **tweet_coord**: if the user has location turned on, the coordinates as a string with the format "[*latitude*, *longitude*]" - **tweet_count**: number of tweets that the user has posted - **tweet_created**: when the random tweet (in the **text** column) was created - **tweet_id**: the tweet id of the random tweet - **tweet_location**: location of the tweet; seems to not be particularly normalized - **user_timezone**: the timezone of the user

暂无相关内容。

分享你的想法

去分享你的想法~~

全部内容

欢迎交流分享

开始分享您的观点和意见，和大家一起交流分享.

数据使用声明：

一、数据来源与展示说明：

1、该数据来自于互联网数据采集或服务商的提供，本平台为用户提供数据集的展示与浏览。
2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
3、数据集基本信息来自数据原地址或数据提供方提供的信息，如数据集描述中有描述差异，请以数据原地址或服务商原地址为准。

二、所有权说明：

1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。

三、数据转载说明：

1、如您需要转载本站数据，请保留原数据地址及相关版权声明。

四、侵权与处理说明：

1、如本站中的部分数据涉及侵权展示，请及时联系本站，我们会安排进行数据下线。

所需积分：

0 去赚积分？

502浏览
0下载
0点赞
收藏
分享

今日排行

本月搜索

Dataset Category