Subscribe to RSS feed

splitbrain.org - electronic brain surgery since 2001

Linux OCR Software Comparison

Over the last weeks I spent some time with researching available OCR (Optical Character Recognition) tools for Linux.

I wanted to see how recognition rates differ between the tools and created some very simple images. I took the last stanza of Edgar Allan Poe's “The Raven” and put in an image using different fonts. To make it a tiny bit more complicated I also created a gray scale version with lesser contrast of the same images.

This is the original text:

And the raven, never flitting, still is sitting, still is sitting
On the pallid bust of Pallas just above my chamber door;
And his eyes have all the seeming of a demon's that is dreaming,
And the lamp-light o'er him streaming throws his shadow on the floor;
And my soul from out that shadow that lies floating on the floor
Shall be lifted - nevermore!

And this is how the resulting images looked like:

They all have 300 dpi, the text isn't distorted or arranged in multiple columns, the language is English in pure ASCII-7 and there is no image noise at all. Okay, the “Justy” font isn't your everyday printed font, but resembles a really clean handwriting. Overall this is a really basic task for OCR. Or so I thought.

Let's have a look at the results first:


abbyyocr cuneiform gocr ocrad tesseract
License Proprietary BSD GPL2 GPL3 Apache 2.0
Version 8.0 0.9.0 0.48 0.19 SVN r402
Input-Format PNG1) PNM PNM PNM TIF2)
Recognition rates and time spent:
courier/black 100%
(2.92s)
61%
(1.11s)
67%
(0.09s)
21%
(0.02s)
81%
(0.63s)
courier/gray 100%
(2.85s)
{no} 67%
(0.09s)
21%
(0.03s)
81%
(0.63s)
justy/black 11%
(3.62s)
3%
(1.14s)
31%
(0.11s)
1%
(0.02s)
15%
(0.61s)
justy/gray 14%
(3.45s)
{no} 31%
(0.10s)
1%
(0.02s)
15%
(0.60s)
times/black 100%
(2.80s)
96%
(1.07s)
76%
(0.16s)
82%
(0.03s)
92%
(0.74s)
times/gray 100%
(2.87s)
{no} 76%
(0.16s)
82%
(0.03s)
92%
(0.74s)
verdana/black 100%
(2.90s)
95%
(1.07s)
98%
(0.10s)
98%
(0.03s)
98%
(0.45s)
verdana/gray 100%
(2.85s)
{no} 98%
(0.10s)
98%
(0.02s)
98%
(0.46s)

Recognition scores where calculated by dwdiff's statistic output comparing the original text with the OCR output.

As you can see, the commercial Abbyy software has absolutely no problems with the printed fonts, but fails at the handwriting. It is the slowest of all tested tools, but keep in mind that it also reads nearly any image format, while you probably need to convert your images for the other tools first.

If you prefer a free OCR software, than tesseract is indeed as good as its reputation. Note that I used the most recent version, built from SVN here. Tesseract was a commercial product that was developed in the early nineties and later was bought and open sourced by Google. It is pretty picky about the input image's format, but once you got that right the results are decent enough.

The handwriting recognition worked best in gocr which delivered only mediocre results for the other images. Of course the result is still far from the original poetry.

I was surprised how far from perfect the results for these really simple images were. I initially intended to try some much more complicated images, but the results would have been unrecognizable then.

Tags:
ocr,
linux,
software,
comparison,
abby,
tesseract,
gocr
Similar posts:
1) supports many more
2) convert in.png -depth 8 -alpha off out.tif
Posted on Tuesday, June the 15th 2010 (20 months ago).

Comments?

1
Your little Flattr is down for this post. Wanting to flattr you...

Just saying.
2010-06-15 22:03:34
2
In my opinion, time is not so important here. I prefer waiting 100 times more for spending 2 times less on error correction. By the way, the table shows that, in general, the more OCR program spend on recognition, the better result is. Thanks for comparison, it's an interesting article.
2010-06-15 22:05:48
3
Jacroe, yeah no idea why. I asked the flattr support. Meanwhile you can flattr this article at https://flattr.com/thing/22821 … Comparison

Nikita, yes I agree. Time isn't too important here, except when you want to OCR hundreds of thousands of images...
2010-06-15 22:16:49
4
Hi Andi,

Thank you very much for this article. It is very elucidative and will help me a lot in a task I'm envolved.  :-)
2010-06-16 12:46:50
5
easy-ocr with 99.96% accuracy download from heare http://code.google.com/p/easy-ocr/
2010-07-07 04:20:46
6
Great comparison. I'll give tesseract a try.
2010-08-27 07:35:18
7
Quite interesting.
I would use OCR to 'read' old mainframe-era line-printer ouputs, and it would be interesting to know what is best at this very specific task.
Line printer outputs (at least in my case) is always uppercase ASCII, mixed with punctuations and numbers, representing source code and tabulated data.  Line printers were HI speed devices, and often characters are not uniformly printed nor exactly aligned in rows (some slightly above and some below the row ideal base line.  Background is white paper, or easy-read paper with alternate white and ligh colored lines, mostly blue or green.
What do you think about such a specific task ?
Thanks in advance
2010-11-21 18:54:32
Gigi Piacentini
8
Hi Andreas,

schoener Vergleich. Ich moechte aehnliches machen erweitert auf easyocr und ocropus und den aktuellen versionen deiner bereits getesteten OCR libs.
Hast du dafuer bereits SourceCode, den du bereit waerst mitzuteilen?
Ich wuerde mich freuen ueber eine Antwort oder direkt eine Mail mit dem Code ;)
Solltest du das per Hand gemacht haben, schreibe bitte ebenso, dann weiss ich auf jeden Fall Bescheid.

Vielen Dank und Gruss
Tobias
2010-11-30 11:08:18
Tobias
9
I want to extract text from image of pdf file
2010-12-01 06:25:21
Vince
10
Great information. Finding good OCR software is hard to do. I've had great luck using the free beta software offered by Ricoh Innovations. I highly recommend checking it out at: http://beta.rii.ricoh.com/beta … conversion
2011-01-14 00:45:38
Natalie
11
Nice review, i'll try tesseract. Just discover this, and used gocr. There is a far better result with a 600dpi scan, that is.

With spellcheck and Replace, the correction is really quick.
2011-02-20 22:54:53
boby
12
The Ricoh tool is using abbyocr afaics.

Easy-OCR seems to be a pretty bundle for cuneiform, ocropus and tesseract, so I don't think it will give much better results.
2011-03-10 14:35:45
Douglas
13
Try again with tesseract 3

My PPA makes it easy to install on Maverick and Natty.

https://launchpad.net/~nutznboltz/+archive/tesseract
2011-05-15 19:20:51
nutznboltz
14
OCRFeeder has a graphical user interface and performs a Document Layout Analysis and transfers the layout to capable output formats. It uses either CuneiForm, GOCR, Ocrad or Tesseract as backend OCR engines. Scanners are accessed via SANE. Handles columns and figures but doesn't preserve fonts, paragraph styles, or tables. However you can manually set the typeface and size for output.
2011-06-04 15:41:59
enlightened
15
Hi Andreas,

If you can re-run your tests with OCRAD 0.21, gocr 0.50-pre (http://www.ovgu.de/jschulen/ocr/jocr.tgz), Cuneiform 1.1.0, Tesseract 3.0 – that would be absolutely great!

And the average of all marks will not hurt... to see the "winner".
2011-10-13 18:21:38
16
Have you ever tried with a test using a fake language?  Or several languages mixed?  The results could give a clue, if and to what extend word lists or other techiques are used next to the OCR itself.

Caspar
2011-11-21 11:00:24
Caspar
17
Automatisierung per Script und exelente Ergebnisse mit tesseract (Favorit gegen cuneiform und gocr). Vermisse jedoch Mehrspalten Scanns (mehrere Blocks) von der gescannte Seite (SimpleScan) erstellen - also keine erneute Einscannung durchführen.
Bitte um ein Hinweis bzw. ein Link.
Danke, miroo
2011-12-31 09:56:47
miroo
CAPTCHA

No HTML allowed. URLs will be linked with nofollow attribute. Whitespace is preserved.