TESSBOXES
NAME
SYNOPSIS
DESCRIPTION
EXIT_STATUS
TRAINING_LANGUAGES
OPTIONS
NOTES
BUGS
SOURCE
AUTHOR
SEE ALSO
NAME
tessboxes
− draw and edit tesseract box information.
SYNOPSIS
|
tessboxes [-e] PBMFILE
BOXFILE [-g XY] [-s FACTOR] [-2] [-m MAGNIFY] [-a
ANNOT] |
|
tessboxes |
-V
DESCRIPTION
tessboxes
reads a (black and white) PBM (image) file and a
corresponding box file suitable for training the
tesseract(1) OCR tool to recognize new languages and
character sets. tessboxes draws the boxes on the
image, and can be used to interactively edit the box
file.
When the -e
switch is missing, tessboxes writes a (colour) ppm(5)
file to STDOUT which has the boxes overlaid on the original
image. This is intended as a simple tool that can be used as
a component of a more comprehensive training process, and
the input and output formats are deliberately chosen to be
as simple as possible.
% tessboxes
image.pbm boxes | pnmtopng > image_with_boxes.png
When the -e
switch is used, tessboxes becomes an interactive
editor for the BOXFILE. The terminal shows a list of
labelled boxes, while the corresponding bitmap is shown in a
separate X11 window. Typing one or more ordinary keys
replaces the label of the current box. Typing Ctrl+F cycles
through faster editing modes for bulk processing, where the
cursor moves automatically to the next box, ENTER/SPACE
moves forward in some modes, and BACKSPACE moves
backward.
The following
special keys are recognized, and do not change the current
label.
Ctrl+x quit
editor and save BOXFILE
ESC quit editor but do NOT save BOXFILE
arrow keys select a new box in the list
Ins insert a new (blank) box just before the current box
Del delete the current box
Ctrl+arrow keys grow or shrink the currently selected box
Alt+arrow keys move the currently selected box keeping its
size
Alt+s shrink the currently selected box
Alt+c crop the currently selected box (can use repeatedly)
Alt+a crop ALL the boxes in the image at once
Ctrl+F cycle through fast(er) editing modes
Ctrl+A toggle append/overwrite mode (default is overwrite)
Ctrl+Z cycle through magnification factors up to MAGNIFY
F1-F8 annotate symbol with a predefined string
EXIT_STATUS
tessboxes
returns zero on success, nonzero if an error occurs.
TRAINING_LANGUAGES
The purpose of
tessboxes is to make training tesseract less painful. The
following description is intended to conveniently summarize
the various steps, as they apply to tesseract v.2.03. More
comprehensive information can be found here
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract
In normal
usage, tesseract can read the text in an image as
follows:
% tesseract
input.tif output [-l lang]
This command
produces a file output.txt which contains any recognized
text, with an optional language name. All available
languages can be found in a directory named tessdata
somewhere on the system. Due to limitations in tesseract,
the input image must be a black and white TIFF file without
alpha layer etc., otherwise the output.txt will be blank.
The simplest way to ensure the image can be recognized is to
convert it to PBM format, and then convert it back to TIFF,
as follows:
% convert
input.tif new_input.pbm
% convert new_input.pbm new_input.tif
tessboxes reads
a pbm(5) file for simplicity (which can optionally be
gzipped), so you will need to do this anyway. If you have a
directory full of input files, this can be done in bash(1)
as follows:
% for f in
*.tif; do
convert $f new_${f/tif/pbm};
convert new_${f/tif/pbm} new_$f;
done
To train a new
language, you must compile a set of language files. Suppose
that input.tif is a sample image. First create a boxfile
% tesseract
image.tif boxes batch.nochop makebox
This creates a
file boxes.txt with the coordinates of boxes surrounding the
characters in the image. Due to limitations in tesseract, it
is a good idea to rename the file to match the image name,
with a .box extension:
% mv boxes.txt
image.box
You can edit
the boxfile with tessboxes as follows:
% tessboxes -e
image.pbm image.box
You should
check that each box surrounds a character properly, and has
the correct label for the character. (This is tedious).
Once you have
created a few boxfiles, it remains to compile them into a
tesseract language. Here is the first step in bash(1):
% for f in
*.box; do
tesseract ${f/box/tif} junk nobatch box.train
done
In this
command, tesseract expects the TIFF image name, and will
find the corresponding boxfiles itself, which is why we had
to rename them earlier. For each boxfile, if the command was
successful, then you should now have a file with the same
name and a .tr extension (ie you now have image.tif,
image.pbm, image.box, image.tr).
You should
watch out for error messages which indicate FAILURE or
FATALITY. These messages can occur when boxes overlap, for
example, and may indicate unprocessable data. In the worst
case, tesseract will perhaps not create a .tr file at all.
In a FAILURE, a box may be ignored, whereas FATALITY or
REBALANCE REQD occur when tesseract has less than 3 sample
boxes for some character.
The easiest way
to fix these types of problems is to delete a box, or to
change its coordinates. The -g switch can be used to go
directly to such problem boxes. Just cut and paste the
coordinates as given in a FAILURE message, for example:
% tessboxes -e
image.pbm image.box -g 1871,1154
When you have
enough *.tr files, you can compile the remaining language
files as follows:
% mftraining
*.tr
% cntraining *.tr
% unicharset_extractor *.box
It may be a
good idea to combine several *.tr files if they represent
the same typeface. In that case, do the following (the order
of the files must be identical in all commands):
% cat image1.tr
image2.tr > combined.tr
% cat image1.box image2.box > combined.box
% mftraining combined.tr
% cntraining combined.tr
% unicharset_extractor combined.box
Now choose a
name for your language, eg "mylang". Due to
limitations in tesseract, all the compiled language files
must be named mylang.* and must reside in a directory called
tessdata. Therefore:
% mkdir
tessdata
% mv inttemp tessdata/mylang.inttemp
% mv normproto tessdata/mylang.normproto
% mv pffmtable tessdata/mylang.pffmtable
% mv unicharset tessdata/mylang.unicharset
You still need
some extra files. If you’re training a variant of
English, then you can simply copy the tesseract system
files. Find your system tessdata directory. For example:
% cp
/usr/share/tessdata/eng.DangAmbigs
tessdata/mylang.DangAmbigs
% cp /usr/share/tessdata/eng.freq-dawg
tessdata/mylang.freq-dawg
% cp /usr/share/tessdata/eng.word-dawg
tessdata/mylang.word-dawg
% cp /usr/share/tessdata/eng.user-words
tessdata/mylang.user-words
You are now
done. To read a new image file with the language
"mylang", try this
% export
TESSDATA_PREFIX=./tessdata/
% tesseract image.tif output -l mylang
If you
don’t want to set TESSDATA_PREFIX (never forget the
trailing /), you can also copy all the files
tessdata/mylang.* into the system tessdata directory you
found earlier.
OPTIONS
|
-e |
|
Edit the BOXFILE. This consists
of an interactive editor in the current terminal, and a
graphical window showing the boxes surrounding the letters.
The window can be resized as convenient. In the editor, the
highlight can be moved with the cursor keys, and anything
typed will replace the box label. To change the box
dimensions, use ALT or CTRL and the cursor keys. |
|
-g |
|
This also turns on the -e switch automatically. After
loading the BOXFILE, go directly to the first box whose
corner coordinates are XY. The string XY can be either
"NUMBER,NUMBER" or "(NUMBER,NUMBER)". If
the coordinates are not found, tessboxes exits
immediately. |
|
-s |
|
In conjunction with the -e switch, shifts horizontally
the highlighted box in the graphical display by a FACTOR in
the range [0.01, 0.99]. |
|
-2 |
|
save box files in Tesseract v2 legacy format. Default is
to save in current v3 format, which has an extra page number
column at the end of each line. |
|
-m |
|
In conjunction with -e switch, magnify the image by
integer factor MAGNIFY. Use Ctrl+Z to cycle through original
size. |
|
-a |
|
Redefine annotation phrases. Expects a string ANNOT of
up to 8 phrases delimited by semicolons, eg
";bold;italic" associates the empty phrase with
FN1, "bold" with FN2, "italic" with FN3
and leaves FN4-FN8 at their defaults. |
NOTES
The annotations
(keys FN1-FN8) are saved as comments at the end of each line
of the box file. This shouldn’t cause problems with
tesseract(1), since (at least in v3.x of tesseract) the
extra information is ignored. If the file format is ever
changed, this will become a bug.
BUGS
tessboxes uses
too much CPU when idle.
SOURCE
The source code
for the latest version of this program is available at the
following locations:
http://www.lbreyer.com/gpl.html
AUTHOR
Laird A. Breyer
<laird@lbreyer.com>
SEE ALSO
tesseract(1),
pbm(5), ppm(5), convert(1)
|