There are/were several Open Source OCR projects in the past and the most active one is probably Tesseract.
Optical character recognition (OCR) is not only about pure “character recognition”, but also about
supported image input formats and intelligent PDF processing
adjusted image pre-processing to get suitable characters for recognition separated form backgrounds as well as
straightened text lines
layout analysis to detect what is
text, a text column and the reading order
support of different alphabets
e.g. Latin, Cyrillic, Greek, Hebrew, Arabic …
support of different print types
e.g. regular printed fonts, dot-matrix, typewriter,…
support of recognition languages
e.g, Character sets, dictionaries, …
export formats & options for
e.g. TXT, XML, PDF, Office Formats, HTML, ePUB
ABBYY is developing and improving the core technologies for all of the above mentioned areas for over 20 years.
To bring OCR technology to a new level of speed and quality a lot of scientific work and quality testing has to be made.
Ongoing testing is extremely important, because this is the only way that the complete package is getting better.
ABBYY products and technologies are used worldwide and millions of pages are processed every day. To be able to deliver this generic approach for a universal OCR technology a lot of scientific research in pattern recognition, linguistics and other IT know how has to be build up.
In daily OCR production environments a very broad variety of documents (file types, document layouts, fonts, languages, etc.) have to be processed. The OCR result has to be as good as possible - almost always after only one processing run.
In this area a commercial OCR software probably is worth the investment, because the result of the recognition on most standard documents can be used right away. This statement is not against any of the open source projects, but as a matter of fact even the pre-compiled distributive would not be able to full fill this generic approach.
There are other scenarios, where a specially tuned Open Source OCR engine can deliver better results than the “out of the box ABBYY product”. This can happen on certain images or document types that were not part of the core production process.
Here some external URLs where multiple Open Source OCR Engines were tested.
Read the BLOG Article on splitbrain.org comparing: abbyyocr - cuneiform - gocr - ocrad - tesseract
Linux OCR Software Comparison - (5.2010)
Linux Magazin 07/2010:
Die ABBYY-OCR-Engine für Linux im Test - Richtig gelesen? - sorry German only