ROCHESTER, N.Y., May 16, 2006 – Have you ever had a hardcopy of a document but couldn’t find the electronic one? Or perhaps you wanted to ensure you can find the correct electronic version of a “final” draft you have in your hand?
Using sophisticated linguistic technology developed by researchers at Xerox Research Centre Europe, a new Xerox technology called TrueMatch can locate the original electronic file for you in seconds.
Helping to bridge the gap that exists between hardcopy and digital documents, TrueMatch is now out of the labs and being put to work in the office as an innovative feature of Xerox’s Freeflow SMARTsend Pro 2.0 scanning software, announced today at AIIM/On Demand, a document imaging conference. SMARTsend Pro allows users to securely search and retrieve documents from Xerox DocuShare and Microsoft SharePoint repositories.
TrueMatch has advanced search capabilities that let knowledge workers easily find an electronic copy of a hardcopy document stored in a repository. People can simply run a paper document through a multifunction device and, upon command, TrueMatch will, through its search-like interface, locate the original and multiple versions of the electronic file. TrueMatch can find the exact copy of a document, other versions or related materials.
This is especially useful and cost-effective for businesses that deal with information that is constantly being updated. For example, using this technology and a multifunction system, customers in shops or back offices can retrieve and print the most up-to-date sales price lists on-demand rather than printing centrally and distributing a large amount of material.
How TrueMatch Works
TrueMatch analyzes the content of the document, once a hardcopy has been scanned and processed through Optical Character Recognition (OCR). TrueMatch first extracts the key elements – words or multiword expressions corresponding to possible topics of the document.
TrueMatch next ranks these key elements, taking into account several parameters including the number of times they appear in the input document and their average frequencies in the English language. Top-ranked elements are used to build and run queries. TrueMatch analyzes the documents returned through the queries and compares the results with the input document. To make this comparison, TrueMatch searches for the presence of the key elements in the retrieved documents. When all are present, it will look for finer information, such as the exact word order, to distinguish between a perfect match and a revision.
As a result, TrueMatch is both fast and has a high success rate in identifying the electronic version of the input document – a perfect match – if the input document is distinctive, such as technical documents.
TrueMatch can also match documents even when the input document is only a portion of the entire document needing retrieval. In this case, the longer the input document, the better the results. However, the technology is so ”smart” that in almost all cases it is able to find the correct corresponding document even when a one-page fragment is used.
Because errors can be introduced during OCR, TrueMatch has to be smart and flexible enough to consider documents as candidates for perfect matches even if some key elements are not found in the target. To do this, TrueMatch works with a tolerance level during the matching stage. The tolerance level has a default value, but the administrator can tune it if needed. Poor-quality paper documents can lead to high OCR error levels. Typically errors can impact more than 5 percent of the characters, which means one word in three may be erroneous. At the same time, if only a portion of the paper input has deteriorated, the remaining part is often sufficient to identify the electronic version and return it to the user.