OCR (Optical Character Recognition software converts hard-copy documents into editable text in a word processor by using a scanner) is still an area where the open source world has a lot of catching up to do with commercially available applications (e.g. Nuance – Omnipage). Recently there have been some interesting developments with regards to open source OCR: HP has open sourced some OCR code they developed between 1985 and 1994 called Tesseract, which is now being further developed with help and funds from Google (Google releases opens source OCR).
Austin Acton has written an interesting review where he compares 6 (free) open source packages GOCR, ClaraOCR, Ocre, Ocrad, Tesseract, OCRopus, of wich ClaraOCR, Ocrad and GOCR are in the FreeBSD ports.
The conclusion of the article is:
“The good news is that there are solutions available on Linux right now which interpret documents at up to 99% accuracy. The bad news is that 99% is not 100%, and that anything other than a high quality 400-600 DPI scan of 12-14 point font drops off very quickly in accuracy. The combination of Tesseract and Ocropus is clearly the project we can most rely on to provide the missing elements of a full-featured Free OCR suite.”
The review is an interesting read and I’d recommend it, if you’re interested in OCR and don’t want to pay for expensive propriety software.