Discussion:
Large automated OCR job?
(too old to reply)
Philipp Lenssen
2007-08-17 18:02:47 UTC
Permalink
Hi! I have several tens of thousands of covers of all kinds (mostly
comic book covers) and would like to extract text from them for
coverbrowser.com. Results don't have to be perfect, it would already
be great if the software could find a keyword or two in every other
cover. I have the JPG files locally on a Windows machine, do you know
of any software where I could feed it a list of folders (or some text
files with paths...) and it will store the texts in some format I can
import back?

thanks!!
Ramon F Herrera
2007-08-28 17:37:04 UTC
Permalink
Post by Philipp Lenssen
Hi! I have several tens of thousands of covers of all kinds (mostly
comic book covers) and would like to extract text from them for
coverbrowser.com. Results don't have to be perfect, it would already
be great if the software could find a keyword or two in every other
cover. I have the JPG files locally on a Windows machine, do you know
of any software where I could feed it a list of folders (or some text
files with paths...) and it will store the texts in some format I can
import back?
thanks!!
The rule number one for OCR reading is to use TIFF, a format which was
designed *specificaly* for this purpose. JPEG is for photographs.

-Ramon
F***@gmail.com
2007-08-31 02:08:07 UTC
Permalink
I suggest to use ABBYY Recognition Server. That is what I would use
(and have used at WiseTrend) on such projects. Drop-and-drop in your
folders with sub-folders, and the software will re-create the
structure and export text results. Then you can query text results,
or even XML, for your keywords. Quite a few options to choose from.
It will pull out all visible text accurately if run in "OCR for
indexing" mode.

Let me know if you need more information.

ilya
Milind Joshi
2007-09-05 22:42:28 UTC
Permalink
Post by F***@gmail.com
I suggest to use ABBYY Recognition Server. That is what I would use
(and have used at WiseTrend) on such projects. Drop-and-drop in your
folders with sub-folders, and the software will re-create the
structure and export text results. Then you can query text results,
or even XML, for your keywords. Quite a few options to choose from.
It will pull out all visible text accurately if run in "OCR for
indexing" mode.
Let me know if you need more information.
ilya
We have a software module within ixtract, called ixTexter, that does
this.

ixTexter's special in it that internally, under the hood, there are
several engines, and depending on performance parameters, one or more
can be switched on/off and results can be voted against each other and
against a custom dictionary if required. It also extracts keywords
automatically when producing searchable PDFs using several innovative
NL techniques. It outputs XML, CSV, PDF(image over text, text over
image, text only, image only), and custom formats can be supported.

However, if the number of images is less, then we'd probably recommend
that a service bureau who uses ixTexter could process these for you at
a fraction of the cost of licensing ixTexter, especially if this is a
one-time conversion.

Do you have samples? JPG would be fine for now.

Regards,
Milind Joshi

IDEA TECHNOSOFT INC.
http://www.ideatechnosoft.com/ocr_icr.html

Loading...