Fujitsu ScanSnap S510 with abbyy

Discussion:

(too old to reply)

Jonathan Berry

2007-11-14 23:59:18 UTC

I'm considering buying the Fujitsu ScanSnap S510 which has a, er,
special version of abbyy, plus Adobe 8.

Reviews of this machine are very good, but I also have a fair number
of non-sheet things to scan to OCR. I figure I'll use my flat bed
scanner, but then will the "special" version of Abbyy OCR from those
jpg's, or is it "hard-wired" to only work with the ScanSnap?

Does Abbyy have any programmability? I do some of scanning of chess
notation, which can be constrained by straightforward Regular
Expressions. Other OCRs I tried, wanted to make words out of things
like Qe5 Na2 and so on, even with the dictionary turned off.

Any other comments are welcome. TIA.

--
Jonathan Berry

s***@gmail.com

2007-11-15 06:57:45 UTC

Permalink

Post by Jonathan Berry
I'm considering buying the Fujitsu ScanSnap S510 which has a, er,
special version of abbyy, plus Adobe 8.
Reviews of this machine are very good, but I also have a fair number
of non-sheet things to scan to OCR. I figure I'll use my flat bed
scanner, but then will the "special" version of Abbyy OCR from those
jpg's, or is it "hard-wired" to only work with the ScanSnap?
Does Abbyy have any programmability? I do some of scanning of chess
notation, which can be constrained by straightforward Regular
Expressions. Other OCRs I tried, wanted to make words out of things
like Qe5 Na2 and so on, even with the dictionary turned off.
Any other comments are welcome. TIA.
--
Jonathan Berry

Hi Jonathan,

ScanSnap doesn't use common TWAIN drivers, as normal flatbed scanners
do. It rather uses it' own unique software to scan. That is why ABBYY
FineReader for ScanSnap is "hard-wired" to ScanSnap only, and it
cannot accept images from other scanners.

As to chess notations scanning, you should look at ABBYY FineReader
9.0. It allows you to define user languages, based on regular
expressions. You can download a trial version from ABBYY site and see
if it works for your images: http://finereader.abbyy.com/
The user languages functionality is available only in the retail
version of ABBYY FineReader. The version for ScanSnap doesn't include
it.

Vera

Jonathan Berry

2007-12-06 23:59:47 UTC

Permalink

Much truth, Vera, thank you.

I went ahead and bought the Fujitsu ScanSnap S510. So far I am mostly
pleased
with the performance of the scanner and of the accompanying software
Adobe Acrobat 8.0 (which I have updated to 8.1.1) Standard. I'm
amazed at
what you can do with Acrobat Standard, and perhaps I have yet only
scratched
the surface.

I am a bit more iffy about the crippled Abbyy which is included in the
package.
It does a pretty good job OCRing typed material, and it is much better
in recognizing
page formats than I experienced in my last foray, around a decade ago,
into the world of OCR (Pagis Pro software that was so nasty that I
uninstalled it within half an hour and resolved never again to buy one
of their products). However, Abbyy still makes too many errors in
recognizing the text of old newspaper columns. The RTF file which it
produces is difficult to edit (the print is too small for the screen
and it takes too much tweaking to get readable) and has lots of little
boxes that interfere with text flow. If you have hundreds of pages to
edit, you want to be able to get in quickly and do the necessary
corrections. The Abbyy Scan2Excel was unable to make sense of phone
bills. In many pages of bills, it was able to put the figures in
columns only three times, and on each occasion it did that job
differently. These were all in the same file and bills from the same
telephone company.

There are (almost) no user settings to adjust, and no real-time
controls.

Admittedly, OCR is a more difficult task than the jobs which Acrobat
is called upon to do. And the Abbyy software with the ScanSnap is
based upon FineReader 7, while the current version is 9. The full
version 9 alone would cost more than the scanner with software.
Still, I don't have to be pleased with every bargain.

The bottom line is that I end up converting most documents to text-
searchable (it takes some moments of CPU time, but it makes the
resulting files smaller) PDF files using Acrobat, and rarely do I use
the full Abbyy OCR.

--
Jonathan Berry

Post by s***@gmail.com

Hi Jonathan,
ScanSnap doesn't use common TWAIN drivers, as normal flatbed scanners
do. It rather uses it' own unique software to scan. That is why ABBYY
FineReader for ScanSnap is "hard-wired" to ScanSnap only, and it
cannot accept images from other scanners.
As to chess notations scanning, you should look at ABBYY FineReader
9.0. It allows you to define user languages, based on regular
expressions. You can download a trial version from ABBYY site and see
if it works for your images:http://finereader.abbyy.com/
The user languages functionality is available only in the retail
version of ABBYY FineReader. The version for ScanSnap doesn't include
it.
Vera

Milind Joshi

2007-12-07 17:14:13 UTC

Permalink

Post by Jonathan Berry
Much truth, Vera, thank you.
I went ahead and bought the Fujitsu ScanSnap S510. So far I am mostly
pleased
with the performance of the scanner and of the accompanying software
Adobe Acrobat 8.0 (which I have updated to 8.1.1) Standard. I'm
amazed at
what you can do with Acrobat Standard, and perhaps I have yet only
scratched
the surface.
I am a bit more iffy about the crippled Abbyy which is included in the
package.
It does a pretty good job OCRing typed material, and it is much better
in recognizing
page formats than I experienced in my last foray, around a decade ago,
into the world of OCR (Pagis Pro software that was so nasty that I
uninstalled it within half an hour and resolved never again to buy one
of their products). However, Abbyy still makes too many errors in
recognizing the text of old newspaper columns. The RTF file which it
produces is difficult to edit (the print is too small for the screen
and it takes too much tweaking to get readable) and has lots of little
boxes that interfere with text flow. If you have hundreds of pages to
edit, you want to be able to get in quickly and do the necessary
corrections. The Abbyy Scan2Excel was unable to make sense of phone
bills. In many pages of bills, it was able to put the figures in
columns only three times, and on each occasion it did that job
differently. These were all in the same file and bills from the same
telephone company.
There are (almost) no user settings to adjust, and no real-time
controls.
Admittedly, OCR is a more difficult task than the jobs which Acrobat
is called upon to do. And the Abbyy software with the ScanSnap is
based upon FineReader 7, while the current version is 9. The full
version 9 alone would cost more than the scanner with software.
Still, I don't have to be pleased with every bargain.
The bottom line is that I end up converting most documents to text-
searchable (it takes some moments of CPU time, but it makes the
resulting files smaller) PDF files using Acrobat, and rarely do I use
the full Abbyy OCR.
--
Jonathan Berry

Post by s***@gmail.com

Hi Jonathan,
ScanSnap doesn't use common TWAIN drivers, as normal flatbed scanners
do. It rather uses it' own unique software to scan. That is why ABBYY
FineReader for ScanSnap is "hard-wired" to ScanSnap only, and it
cannot accept images from other scanners.
As to chess notations scanning, you should look at ABBYY FineReader
9.0. It allows you to define user languages, based on regular
expressions. You can download a trial version from ABBYY site and see
if it works for your images:http://finereader.abbyy.com/
The user languages functionality is available only in the retail
version of ABBYY FineReader. The version for ScanSnap doesn't include
it.
Vera

Yes, it is well-known that for special fonts, degraded text material,
and handprint, such bundled software is not a good fit.

If you have anything more than 100 pages a day to scan, and you want
to do special things with those that enhance your productivity, you
need applications that go beyond what such "bundled" apps can provide,
and that is not their focus too... the way it works for scanner
manufacturers - they want to give as many people as possible a reason
to buy their product. So they strike special deals with companies like
ABBYY to "bundle" scaled-down/crippled versions of software to give
away with their scanners. For the software manufacturer, this is brand
awareness and the fact that you have tried out their technology and
(hopefully) been impressed.

Beyond a certain point, even the fujitsu scansnap begins to hamper
productivity... can you believe that there are document scanners that
cost hundreds of thousands of dollars and even millions of dollars?
Right, there's a range for every (mostly) imaginable need.

ABBYY, for example, makes their "engine" available to developers, and
they also have higher-end retail products labelled like "professional/
enterprise".

Many other manufacturers do similar things, depending on which model
of scanner you buy.

That is how ABBYY and other OCR/ICR manufacturers earn money... :-)

Best Regards,
Milind Joshi
IDEA TECHNOSOFT INC.
http://www.ideatechnosoft.com/ocr_icr.html

Jonathan Berry

2007-12-12 06:56:14 UTC

Permalink

Post by Milind Joshi
Yes, it is well-known that for special fonts, degraded text material,
and handprint, such bundled software is not a good fit.

You are right!
Easy for humans to read is not necessarily the ticket for machine-
based
recognition.

Post by Milind Joshi
If you have anything more than 100 pages a day to scan,

I have let's say 2 or 3 thousand pages to scan, eventually, but the
number of pages is not increasing.

Post by Milind Joshi
and you want
to do special things with those that enhance your productivity, you
need applications that go beyond what such "bundled" apps can provide,
and that is not their focus too... the way it works for scanner
manufacturers - they want to give as many people as possible a reason
to buy their product. So they strike special deals with companies like
ABBYY to "bundle" scaled-down/crippled versions of software to give
away with their scanners. For the software manufacturer, this is brand
awareness and the fact that you have tried out their technology and
(hopefully) been impressed.

Yes.

If that had been my experience, I would not have bothered to post it.
The ABBYY products performed less well than I would have hoped. After
a bad experience more than a decade ago with the other major OCR
purveyor, today I'm inclined to scan documents into Adobe, make them
text searchable, and leave it at that. If somebody comes up with
better OCR in a decade, I (or my successor) can scan from the PDFs.

Post by Milind Joshi
Beyond a certain point, even the fujitsu scansnap begins to hamper
productivity... can you believe that there are document scanners that
cost hundreds of thousands of dollars and even millions of dollars?

Yes.

Post by Milind Joshi
Right, there's a range for every (mostly) imaginable need.
ABBYY, for example, makes their "engine" available to developers, and
they also have higher-end retail products labelled like "professional/
enterprise".

Having looked at their lowest offering, I am not tempted to get
on the ladder.

Post by Milind Joshi
Many other manufacturers do similar things, depending on which model
of scanner you buy.
That is how ABBYY and other OCR/ICR manufacturers earn money... :-)

Or, if it doesn't work, how Adobe makes even more money.

Post by Milind Joshi
Best Regards,
Milind Joshi
IDEA TECHNOSOFT INC.http://www.ideatechnosoft.com/ocr_icr.html

--
Jonathan Berry

Milind Joshi

2007-12-12 20:53:07 UTC

Permalink

Hi Jonathan,

Very interesting responses.
I am intrigued. Do you have sample images for us to play with?

Post by Jonathan Berry
I have let's say 2 or 3 thousand pages to scan, eventually, but the
number of pages is not increasing.

Is that a one-time 2-3 thousand or 2-3 thousand per day?
If it is one-time, your best bet is to get someone to type it out.

Post by Jonathan Berry
If that had been my experience, I would not have bothered to post it.
The ABBYY products performed less well than I would have hoped. After
a bad experience more than a decade ago with the other major OCR
purveyor, today I'm inclined to scan documents into Adobe, make them
text searchable, and leave it at that. If somebody comes up with
better OCR in a decade, I (or my successor) can scan from the PDFs.

That is quite an interesting approach, a lot of people do this
already, but remember that not all PDFs are created equal.

Also, the image quality, which depends on the quality of the light
(fluorescent vs. LED vs. filament lamp), quality of the lens, and the
quality of the image compression algorithm is very important. Bad
images today could mean bad OCR results for ever, or at least the
foreseeable future.

For example, many PDF convertors, downscale the DPI of the original
scan to 72DPI or thereabout because screen and print rendering don't
need much more than that resolution. You can forget about getting good
OCR results with an image that is downscaled...

If you want to save them as PDF, then make sure the original image
(behind which the recognized text is), stays at a high enough
resolution.

If you want to use a next-generation OCR, you would do well to leave
the original images intact, in addition to making the current-quality
PDFs.

Best Regards,
Milind Joshi
IDEA TECHNOSOFT INC.
http://www.ideatechnosoft.com/ocr_icr.html

Don

2007-12-21 21:44:11 UTC

Permalink

Post by s***@gmail.com
Hi Jonathan,
Very interesting responses.
I am intrigued. Do you have sample images for us to play with?

Post by Jonathan Berry
I have let's say 2 or 3 thousand pages to scan, eventually, but the
number of pages is not increasing.

Is that a one-time 2-3 thousand or 2-3 thousand per day?
If it is one-time, your best bet is to get someone to type it out.

That is quite an interesting approach, a lot of people do this
already, but remember that not all PDFs are created equal.
Also, the image quality, which depends on the quality of the light
(fluorescent vs. LED vs. filament lamp), quality of the lens, and the
quality of the image compression algorithm is very important. Bad
images today could mean bad OCR results for ever, or at least the
foreseeable future.
For example, many PDF convertors, downscale the DPI of the original
scan to 72DPI or thereabout because screen and print rendering don't
need much more than that resolution. You can forget about getting good
OCR results with an image that is downscaled...
If you want to save them as PDF, then make sure the original image
(behind which the recognized text is), stays at a high enough
resolution.
If you want to use a next-generation OCR, you would do well to leave
the original images intact, in addition to making the current-quality
PDFs.
Best Regards,
Milind Joshi
IDEA TECHNOSOFT INC.
http://www.ideatechnosoft.com/ocr_icr.html

Tested the new ABBYY 9.0 under the trial.
The program is superb and the newest feature to scan PDF's even more
superb.

After scanning and OCRing tens of thousands, the best improvement to any
OCR work is a valid dictionary that is relative to the materials your
OCRing.