Editions   North America | Europe | Magazine


Xerox Scientists Apply Insights from Ethnography to Develop New Way to Categorize Documents

Press release from the issuing company

ROCHESTER, N.Y.--July 12, 2005-- Employing the same ethnographic methods used to observe the social order on a Polynesian atoll or document the culture of natives in southern Siberia, Xerox Corporation scientists have injected more human know-how into text mining, the practice of using computer analysis of documents to extract new information. The result is better categorization, with higher-quality, customized results. In a paper titled "Work Practice in Research: A Case Study" being presented here today at the International Council on Systems Engineering symposium, Nathaniel G. Martin, an ethnographer and computer scientist in the Xerox Innovation Group in Webster, N.Y., described the new technology. Categorization is a powerful form of text mining. It associates a document with subject categories that a computer learns from a "training set" of documents that a subject matter expert has classified by hand. The new software program improves the speed and accuracy of categorizing systems because it helps the subject matter expert interactively create the training set, choosing and refining the categories and the conditions under which they are applied. It's a technique that could improve results from traditional categorizing systems and is particularly useful for classifying short documents, according to Martin. The scientists' discovery grew out of request from a Xerox engineering group for help analyzing service logs, the record of calls from service technicians in the field to company engineers about problems with production printer and copier operation. The engineering group was manually classifying these logs so they could identify and devote their efforts to solving the most important problems. They asked XIG scientists to develop an algorithm that would automate the way service log problems were grouped into categories. A traditional categorizing system would have learned from the work they had done, following the classification pattern already defined by the user. The categories would then remain static. However, when Martin and his colleagues used ethnographic techniques like conducting open-ended interviews and videotaping an engineer as he continued to categorize the service logs, they realized that what he was doing did not fit the traditional description of categorizing. "Instead of performing a routine task of applying a predetermined label to each log in a highly constrained fashion, we saw that he was constructing additional categories as he read the logs," Martin said. Working with the subject matter expert, the Xerox scientists developed a system that allows a subject matter expert to develop categories dynamically in a way a machine-learning system could not. "The new system allows exploratory categorization that falls somewhere between categorization and clustering," Martin said. "It provides categories into which text data can be organized, but it allows the subject matter expert to change the categories as new ones are discovered." This new technique reduced the time required to categorize the service logs from a week to a few minutes, and the group is more productive. Now the new software program is being used in other Xerox organizations to analyze unstructured responses such as comments from customers. Xerox has applied for a patent on the technology. In addition, at the INCOSE symposium Anthony M. Federico, vice president, platform development for the Xerox Production Systems Group, will give a keynote speech tomorrow on "System Engineering in Advanced Color Imaging." Symposium attendees can also tour Xerox's Webster research and manufacturing complex to learn about the principles of color digital printing and how paper choice impacts printing.

WhatTheyThink is the official show daily media partner of drupa 2024. More info about drupa programs