30 Sep 2008

5 Lessons Learned About OCR In EDM.

Hailed as the way in which we can breath life into our static, paper documents, Optical Character Recognition (OCR) has made strides in the recent decades

30 Sep 2008

Hailed as the way in which we can breath life into our static, paper documents, Optical Character Recognition (OCR) has made strides in the recent decades – becoming a staple module in just about every software package managing documents. From Nuance’s PaperPort to EMC’s Documentum.

OCR, itself, can mean various things. Wikipedia offers this definition:

… the mechanical or electronic translation of images of handwritten, typewritten or printed text (usually captured by a scanner) into machine-editable text (2008).

library_image While many estimate the accuracy levels for OCR engines can reach 98 or 99 percent, it has been my experience this is very difficult to achieve in most commercially-available software suites for the small-to-medium businesses (SMB’s). Many variables can affect the accuracy levels of output, ranging from document condition to readability.

With so many variables in scanning paper based documents it is often not possible to gain high accuracy ratings on a small budget. Thus OCR can often be a challenge to implement in many SMB’s.

When the rubber meets the road:

Typical applications of OCR revolve around digitizing documents and transforming them into an image along with usable metadata of the information contained on the physical page itself. In essence, the computer reads the document and creates a library of searchable information.

This type of application allows an EDM solution the opportunity to build a database of text, contextually tied back to the original images as a layer of the document, or image, itself. Searching for usable information within and across documents is much easier. In other words, it gets you in the right neighborhood.

Extremely high accuracy rates are often not at issue in these applications, because the indexes can be combined with this database of textual information dramatically increasing the findability of information.

Where are the brakes on this thing?

Where problems can begin to occur is when OCR is not applied to the text contained within the scanned document, but used to lift index values themselves (e.g. customer name, number, etc.). Why is this so dangerous?

Combined with other technology and processes, OCR itself is a wonderful aid in seeking efficiency within the business. However, with no quality assurance or stop-loss measures in place, It is highly likely a document will be misplaced due to a character being off here or there… In essence, you now have a needle in a haystack.

The advice I would offer is quite simple:

  1. Document preparation is key to ensuring efficient use of personnel time as well as achieving high accuracy levels.
  2. Quality assurance on key information is requisite if high levels of accuracy are required – especially in audit or regulatory scenarios.
  3. Know your threshold of pain and what you can accept; Know your goals. (Do you need 100% accuracy?).
  4. The key to findability of information contained within documents is to enforce process.
  5. Create an accountability structure based around solving issues rather than blaming others. In high demand environments, appointing a “scanning czar” is critical.

Did I miss something? Do you have another opinion or experience you would like to share? Comments, suggestions and respectful disagreements are always welcome.

Ken Stewart’s blog, ChangeForge.com, focuses on the collision between the constantly changing worlds of business and technology. Ken is also the Director of Technology at Kearns Business Solutions.

Leave a comment
More Posts