公开数据集
数据结构 ? 2.03M
Data Structure ?
* 以上分析是由系统提取分析形成的结果,具体实际数据为准。
README.md
F. Prat(*), M. J. Castro(+), D. Llorens(*), A. Marzal(*), and J. M. Vilar(*)
* Departamento de Lenguajes y Sistemas Informáticos
Universitat Jaume I (UJI), 12071 Castellón, SPAIN
+ Departamento de Sistemas Informáticos y Computación
Universidad Politécnica de Valencia (UPV), 46071 Valencia, SPAIN
fprat '@' lsi.uji.es
December 2008
Data Set Information:
We have created the UJIpenchars2 character database by collecting samples from 60 writers at two different sites in two phases:
1st phase, 11 writers, carried out at UJI.
2nd phase, 49 writers, carried out at UPV (44 writers) and UJI (5).
Each writer contributed with letters, digits, and other characters and two samples were collected for each pair writer/character.
The complete lexicon is as follows:
66 letters (33 per case):
Letter n with tilde (2 characters).
Vowels with acute accent (10 characters).
Letter u with diaeresis (2 characters).
The 52 ASCII letters.
The 14 Spanish non-ASCII letters:
The 10 digits.
Other 21 characters:
Inverted question and exclamation marks (2 characters).
Masculine and feminine ordinal indicators (2 characters).
The euro sign (1 character).
The 16 ASCII ones shown in the following line:
. , ; : ? ! ' ' ( ) % - @ $ < >5 non-ASCII ones:
So the total number of samples in this database is 11640: 60 writers x (66+10+21) characters x 2 repetitions
UJIpenchars is a subset of UJIpenchars2 with only 1364 samples: the ASCII letters and digits collected at UJI during the 1st acquisition phase.
We have not defined a standard task for UJIpenchars2, but divided the writer set into two disjoint subsets in order to ease the definition of writer independent tasks:
40 'trn' writers:
The 11 1st phase UJI writers.
29 UPV writers.
20 'tst' writers:
The 5 2nd phase UJI writers.
15 UPV writers.
The distribution of our database consists of 2 files:
This 'uji2.names'.
The file 'ujipenchars2.txt' containing all the samples in a format described later.
The handwriting samples were collected on a Toshiba Portégé M400 Tablet PC using its cordless stylus. Each one of the 60 writers completed 2 non-consecutive sessions. In each session, the corresponding writer was asked to write one exemplar for each character in the lexicon. The acquisition program shows a set of boxes on the screen, one for each required character, and writers are told to write only inside those boxes.
Each acquisition box is approximately 13.6 millimetres wide and 20.4 millimetres tall and contais two horizontal guides at approximate distances of 7.5 and 12.7 millimetres from top, respectively.
Writers were instructed to clear the content of the corresponding box by using an on-screen button and try again whenever they made a mistake or were unhappy with the writing of any character. Subjects were monitored only when writing their first exemplars and every sample considered OK by its writer was accepted, even if some of its points lay out of the corresponding acquisition box.
only X and Y coordinate information was recorded along the strokes by the acquisition program, without, for instance, pressure level values or timing information. Thus, in multi-stroke samples, no information at all was recorded between strokes. Both coordinates were expressed as integer ink units, with the origin lying at the top left corner of the corresponding acquisition box. X values grow left-to-right and Y values grow downwards.
Although we have employed the same acquisition program on identical hardware at UJI and UPV, we have observed that acquisition files seem to show that UPV samples have been collected using acquisition boxes larger than UJI ones. This is due to a different configuration parameter value that, at UPV, makes the acquisition program translate 1 millimetre into 152 ink units, instead of using the standard UJI ratio: 100 ink units per millimetre. If box homogenisation is needed, it can be easily achieved, for instance, by dividing UPV coordinate values by 1.52.
We have also observed that runs of consecutive points with identical coordinates were frequently acquired inside strokes; such runs were preserved in this database, so it is up to its users to decide whether to avoid them by an appropriate preprocessing step or not.
Although it is a paper mainly devoted to UJIpenchars,
D. Llorens et al.: 'The UJIpenchars Database: A Pen-based Database of Isolated Handwritten Characters'
Proc. of the 6th International Conference on Language Resources and evaluation. 2008.
contains useful information about UJIpenchars2. It can be found in [Web link].
Attribute Information:
The file 'ujipenchars2.txt' is a text one with a simple format where all database samples are represented. Because some non-ASCII characters are needed, UTF-8 encoding is employed.
In order to describe how attributes are represented in 'ujipenchars2.txt', it is worth explaining the general syntax of the file first. From the higher-level point of view, this file is composed of comment lines and sample representations.
A comment line is one beginning with two slashes. In 'ujipenchars2.txt', we have employed comment lines for two purposes:
Prior to the set of samples corresponding to each site, a comment acts as a reminder of the number of ink units per length unit on the Tablet PC screen, so these two comments can be found in 'ujipenchars2.txt':
// UJI: 100 units per millimetre // UPV: 152 units per millimetre
Prior to each sample representation, an ASCII comment tells you which character it represents. For ASCII characters (for instance, an uppercase u), comments may look like this:
// ASCII char: U
For non-ASCII characters (for instance, a lowercase o with acute accent), the character identity is represented by means of its HTML entity name:// Non-ASCII char: oacute
A sample representation is composed of a header line followed by the representation of its *sequence of strokes*, where the header line consists of three blank-separated elements: the word 'WORD', the representation of the *character identity*, and the *session identifier*. For instance, a semicolon sample representation may look like this:
WORD ; trn_UJI_W03-01 NUMSTROKES 2 POINTS 9 # 541 1001 541 1001 540 987 540 987 530 977 530 977 530 977 530 977 530 977 POINTS 8 # 518 1227 500 1257 480 1291 470 1309 465 1318 458 1330 458 1330 471 1312
A detailed description on how information about each attribute is represented in 'ujipenchars2.txt' follows:
Character identity: It is represented by the character itself (';' in the previous example), one out of 97 possibilities. Remember that UTF-8 encoding is employed, so non-ASCII characters need more than a byte to be encoded.
Session identifier: It is composed by a long writer identifier ('trn_UJI_W03' in the previous example) and a repetition number ('01' or '02') separated by a hyphen, where a long writer identifier consists of three elements separated by underscores:
A writer set identifier, 'trn' (writers for training) or 'tst' (writers for test).
A site identifier, 'UJI' or 'UPV'.
A short writer identifier, like 'W03' in the previous example. Writers
帕依提提提温馨提示
该数据集正在整理中,为您准备了其他渠道,请您使用
- 分享你的想法
全部内容
数据使用声明:
- 1、该数据来自于互联网数据采集或服务商的提供,本平台为用户提供数据集的展示与浏览。
- 2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
- 3、数据集基本信息来自数据原地址或数据提供方提供的信息,如数据集描述中有描述差异,请以数据原地址或服务商原地址为准。
- 1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。
- 1、如您需要转载本站数据,请保留原数据地址及相关版权声明。
- 1、如本站中的部分数据涉及侵权展示,请及时联系本站,我们会安排进行数据下线。