公开数据集

大型室外中文字符OCR标注数据集,包含3850个独特字符的约100 万个汉字

36.23G

716 浏览

0 喜欢

3 次下载

0 条讨论

Action/Event Detection Classification

In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in......

数据介绍
文件预览
相关论文
Code
分享讨论(0)
使用声明

启动Notebook开发

数据结构 ? 36.23G

README.md

In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some more complicated character sets such as Chinese text. Lack of training data has always been a problem, especially for deep learning methods which require massive training data. In this paper, we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3850 unique ones annotated by experts in over 30000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc. Besides the dataset, we give baseline results using state-of-the-art methods for three tasks: character recognition (top-1 accuracy of 80.5%), character detection (AP of 70.9%), and text line detection (AED of 22.1). The dataset, source code, and trained models are publicly available.

1. 32,285 high resolution images

2. 1,018,402 character instances

3. 3,850 character categories

4. 6 kinds of attributes

evaluation Server

The evaluation server is available on CodaLab.
You should submit a .zip file, which contains one .jsonl file in the top-level directory. Submission formats and evaluation metrics for classification task and detection task are described in tutorial part-2 and part-3, respectively.
Sample submissions can be downloaded from "public submissions" of corresponding competition on CodaLab. You may need to login to CodaLab before downloading.
Detailed results are provided in the "view detailed results" link for each submission.

Contact

If you have any questions about the dataset or code, please contact Tai-Ling Yuan (yuantailing[at]gmail.com).

Bibtex:

@article{yuan2019ctw,
  author  = {Tai{-}Ling Yuan and Zhe Zhu and Kun Xu and Cheng{-}Jun Li and Tai{-}Jiang Mu and Shi{-}Min Hu},
  title   = {A Large Chinese Text Dataset in the Wild},
  journal = {Journal of Computer Science and Technology},
  volume  = {34},
  number  = {3},
  pages   = {509--521},
  year    = {2019},
}

Change Log

06/17/2019 (GMT+8): replace the paper with A Large Chinese Text Dataset in the Wild
07/04/2018 (GMT+8): dataset moved to OneDrive
03/17/2018 (GMT+8): evaluation server available
03/15/2018 (GMT+8): dataset released on WeiYun and Google Drive
02/28/2018 (GMT+8): website comes online

Terms of Use

The public annotations and trained models belong to the CSCG Group and are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The images belong to Tencent ltd. and are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Most of the baseline code belongs to Tai-Ling Yuan and is licensed under the MIT License.

暂无相关内容。

分享你的想法

去分享你的想法~~

全部内容

欢迎交流分享

开始分享您的观点和意见，和大家一起交流分享.

数据使用声明：

一、数据来源与展示说明：

1、该数据来自于互联网数据采集或服务商的提供，本平台为用户提供数据集的展示与浏览。
2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
3、数据集基本信息来自数据原地址或数据提供方提供的信息，如数据集描述中有描述差异，请以数据原地址或服务商原地址为准。

二、所有权说明：

1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。

三、数据转载说明：

1、如您需要转载本站数据，请保留原数据地址及相关版权声明。

四、侵权与处理说明：

1、如本站中的部分数据涉及侵权展示，请及时联系本站，我们会安排进行数据下线。

所需积分：

35 去赚积分？

716浏览
3下载
0点赞
收藏
分享

今日排行

本月搜索

Dataset Category