街景挑战框架是一个图像、注释、软件和物体检测性能测量的集合。 每张图片都是由DSC-F717相机拍摄的，地点在马萨诸塞州的波士顿及其周边。 然后对每张图片进行手工标注，用多边形围绕9个物体类别的每个例子，包括[汽车、行人、自行车、建筑物、树木、天空、道路、人行道和商店]。 这些图像的标注是在仔细检查下完成的，以确保物体总是以相同的方式被标注，考虑到遮挡和其他常见的图像变换。 StreetScenes标签也与LabelMe注释兼容，这里提供了一个一对一的转换工具。 关于数据收集的更多信息，请参见Stanely Bileschi的论文。
Object Detection Models
Crop-Wise Object Detection
Crop-Wise object detection is a simple and common way of measuring the power of an object detection system. In this method, small crops of positive and negative examples of the target object category are first extracted from the larger images. For instance, positive car images would contain nicely cropped images of cars, while negative car images would contain anything but cars. These images are represented mathematically somehow, e.g. with wavelets or histograms of gradients or whatever, and then a statistical learning machine is employed to learn a classifier between the two sets. In order to measure the efficacy of the learned detector, part of the training set is reserved to measure the performance (I prefer to use about one third). Repeating this training/testing split several times gives a statistically significant measure of crop-wise object detection.
Point-Wise Object Detection
Point-Wise object detection is similar to crop-wise object detection, except that rather than classifying boxes which fit around the object of interest, instead we classify points (and their neighborhoods) inside the object. In this method, a positive set and negative point set is selected (i.e. points inside and outside of the object). At each of these points, a mathematical feature is extracted, which in general depends on patterns of brightness and color in the neighborhood of the point. once these features have been extracted, learning and testing occur as in crop-wise object detection.
Bounding Box-Wise Object Detection
Bounding Box-Wise object detection the measure closest to actually running a useful object detection system on these types of scenes. In this method, an object detector is trained, as in crop-wise detection, but then applied to a reserved set of test images at multiple positions and scales. The response of the detector is fed to a local-neighborhood suppression algorithm, which outputs a set of positions and confidences within the test set for possible object existence. This set is then compared to the human benchmark positions, and detections which are close enough in position and scale are called true detections. Using this data, a precision-recall curve is drawn to measure the total system performance.