To the best of our knowledge this is the largest publicly available dataset of face images with gender and age labels for training. We provide pretrained models for both age and gender prediction.
Since the publicly available face image datasets are often of small to
medium size, rarely exceeding tens of thousands of images, and often
age information we decided to collect a large dataset of celebrities.
For this purpose, we took the list of the most popular 100,000 actors as
listed on the IMDb website and (automatically) crawled from their
profiles date of birth, name, gender and all images related to that
Additionally we crawled all profile images from pages of people from
Wikipedia with the same meta information.
We removed the images without timestamp (the date when the photo was
Assuming that the images with single faces are likely to show the actor
and that the timestamp and date of birth are correct, we were able to
assign to each such image the biological (real) age. Of course, we can
not vouch for the accuracy of the assigned age information. Besides
wrong timestamps, many images are stills from movies - movies that can
have extended production times. In total we obtained 460,723 face images
from 20,284 celebrities from IMDb and 62,328 from Wikipedia, thus
523,051 in total.
As some of the images (especially from IMDb) contain several people we only use the photos where the second strongest face detection is below a threshold. For the network to be equally discriminative for all ages, we equalize the age distribution for training. For more details please the see the paper.
For both the IMDb and Wikipedia images we provide a separate .mat file which can be loaded with Matlab containing all the meta information. The format is as follows:
dob: date of birth (Matlab serial date number)
photo_taken: year when the photo was taken
full_path: path to file
gender: 0 for female and 1 for male, NaN if unknown
name: name of the celebrity
face_location: location of the face. To crop the face in Matlab run
face_score: detector score (the higher the better). Inf implies that no face was found in the image and the face_location then just returns the entire image
second_face_score: detector score of the face with the second highest score. This is useful to ignore images with more than one face. second_face_score is NaN if no second face was detected.
celeb_names (IMDB only): list of all celebrity names
celeb_id (IMDB only): index of celebrity name
The age of a person can be calculated based on the date of birth and the time when the photo was taken (note that we assume that the photo was taken in the middle of the year):