公开数据集

QMNIST

20.34M

1426 浏览

1 喜欢

5 次下载

0 条讨论

MNIST Classification

The exact preprocessing steps used to construct the MNIST datasethave long been lost. This leaves us with no reliable wa......

数据介绍
文件预览
相关论文
Code
分享讨论(0)
使用声明

启动Notebook开发

数据结构 ? 20.34M

* 以上分析是由系统提取分析形成的结果，具体实际数据为准。

README.md

The exact preprocessing steps used to construct the MNIST datasethave long been lost. This leaves us with no reliable way to associate its characters with the ID of the writer and little hope to recover the full MNIST testing set that had 60K images but was never released. The official MNIST testing set only contains 10K randomly sampled images and is often considered too small to provide meaninful confidence intervals.

The QMNISTdataset was generated from the original data found in the NIST Special Database 19with the goal to match the MNIST preprocessing as closely as possible.

Using QMNIST

We describe below how to use QMNIST in order of increasing complexity.

Update - The Pytorch QMNIST loader described in section 2.4 below is now included in torchvision.

Using the QMNIST extended testing set

The simplest way to use the QMNIST extended testing set is to download the two following files. These gzipped files have the same format as the standard MNIST data filesbut contain the 60000 testing examples. The first 10000 examples are the QMNIST reconstruction of the standard MNIST testing digits. The following 50000 examples are the reconstruction of the lost MNIST testing digits.

Filename	Format	Description
`qmnist-test-images-idx3-ubyte.gz`	60000x28x28	testing images
`qmnist-test-labels-idx1-ubyte.gz`	60000	testing labels

Using the QMNIST extended labels

The official NIST training data (series hsf0 to hsf3, writers 0 to 2099) was written by NIST employees. The official testing data (series hsf4, writers 2100 to 2599) was written by high-school students and is considered to be substantially more challenging. Since machine learning works better when training and testing data follow the same distribution, the creators of the MNIST dataset decided to distribute writers from both series into their training and testing sets. The QMNIST extended labels trace each training or testing digit to its source in the NIST Special Database 19. Since the QMNIST training set and the first 10000 examples of the QMNIST testing set exactly match the MNIST training and testing digits, this information can also be used for the standard MNIST dataset. The extended labels are found in the following files.

Filename	Format	Description
`qmnist-train-labels-idx2-int.gz`	60000x8	extended training labels
`qmnist-train-labels.tsv.gz`	60000x8	same, tab separated file
`qmnist-test-labels-idx2-int.gz`	60000x8	extended testing labels
`qmnist-test-labels.tsv.gz`	60000x8	same, tab separated file

The format of these gzipped files is very simlar to the format of the standard MNIST label files. However, instead of being a one-dimensional tensor of unsigned bytes (idx1-ubyte), the label tensor is a two-dimensional tensor of integers (idx2-int) with 8 columns:

Column	Description	Range
0	Character class	0 to 9
1	NIST HSF series	0, 1, or 4
2	NIST writer ID	0-610 and 2100-2599
3	Digit index for this writer	0 to 149
4	NIST class code	30-39
5	Global NIST digit index	0 to 281769
6	Duplicate	0
7	Unused	0

The binary files idx2-int encode this information as a sequence of big-endian 32 bit integers

Offset	Type	Value	Description
0	32 bit integer	0x0c02(3074)	magic number
4	32 bit integer	60000	number of rows
8	32 bit integer	8	number of columns
12..	32 bit integers	...	data in row major order

Due to popular demand, we also provide the same information as TSV files.

The QMNIST data files

The QMNIST distribution provides in fact the following files:

Filename	Format	Description
`qmnist-train-images-idx3-ubyte.gz`	60000x28x28	training images
`qmnist-train-labels-idx2-int.gz`	60000x8	extended training labels
`qmnist-train-labels.tsv.gz`	60000x8	same, tab separated file
`qmnist-test-images-idx3-ubyte.gz`	60000x28x28	testing images
`qmnist-test-labels-idx2-int.gz`	60000x8	extended testing labels
`qmnist-test-labels.tsv.gz`	60000x8	same, tab separated file
`xnist-images-idx3-ubyte.xz`	402953x28x28	NIST digits images
`xnist-labels-idx2-int.xz`	402953x8	NIST digits extended labels
`xnist-labels.tsv.xz`	402953x8	same, tab separated file

Files with the.gz suffix are gzipped and can be decompressed with the standard commmand gunzip. Files with the .xz suffix are LZMA compressed and can be decompressed using the standard commandunxz.

The QMNIST training examples match the MNIST training example one-by-one and in the same order. The first 10000 QMNIST testing examples match the MNIST testing examples one-by-one and in the same order. The xnist-* data files provide preprocessed images and extended labels for all digits appearing in the NIST Special Database 19in partition and writer order. Column 5 of the extended labels give the index of each digit in this file. We found three duplicate digits in the NIST dataset. Column 6 of the extended labels then contain the index of the digit for which this digit is a duplicate. Since duplicate digits have been eliminated from the QMNIST/MNIST training set and testing set, this never happens in the qmnist-* extended label files.

The Pytorch QMNIST loader

Update - The Pytorch QMNIST loader described here is now included in torchvision.

File qmnist.py contains a QMNIST data loader for the popular Pytorchplatform. It either loads the QMNIST data files provided in the same directory as the filepytorch.py or downloads them from the web when passing the option download=True. This data loader is compatible with the standard Pytorch MNIST data loader and also provided additional features whose documentation is best found in the comments located inside pytorch.py.

Here are a couple examples:

from qmnist import QMNIST

# the qmnist training set, download from the web if not found
qtrain = QMNIST('_qmnist', train=True, download=True)

# the qmnist testing set, do not download.
qtest = QMNIST('_qmnist', train=False)

# the first 10k of the qmnist testing set with extended labels
# (targets are a torch vector of 8 integers)
qtest10k = QMNIST('_qmnist', what='test10k', compat=False, download='True')

# all the NIST digits with extended labels
qall = QMNIST('_qmnist', what='nist', compat=False)

Citation

Please use the following citation when referencing the dataset:

@incollection{qmnist-2019,
   title = "Cold Case: The Lost MNIST Digits",
   author = "Chhavi Yadav and L\'{e}on Bottou",\
   booktitle = {Advances in Neural Information Processing Systems 32},
   year = {2019},
   publisher = {Curran Associates, Inc.},
}

暂无相关内容。

分享你的想法

去分享你的想法~~

全部内容

欢迎交流分享

开始分享您的观点和意见，和大家一起交流分享.

数据使用声明：

一、数据来源与展示说明：

1、该数据来自于互联网数据采集或服务商的提供，本平台为用户提供数据集的展示与浏览。
2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
3、数据集基本信息来自数据原地址或数据提供方提供的信息，如数据集描述中有描述差异，请以数据原地址或服务商原地址为准。

二、所有权说明：

1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。

三、数据转载说明：

1、如您需要转载本站数据，请保留原数据地址及相关版权声明。

四、侵权与处理说明：

1、如本站中的部分数据涉及侵权展示，请及时联系本站，我们会安排进行数据下线。

所需积分：

10 去赚积分？

1426浏览
5下载
1点赞
收藏
分享

今日排行

本月搜索

Dataset Category