By Bob Miller April 26, 2004 -- Last month I laid out the general problem of pulling together data for one-to-one marketing. This month I would like to go from a general description of the problem to a very specific solution. I am going to describe a particular small data project at The Rochester Group. The customer was a new business that was offering a service to lawyers. They wanted to launch their new enterprise with a simple one-to-many mailing to individual lawyers. What they had to start with were a hardcopy list, each of three partners' individual contact lists, three member lists from professional associations, and seventeen lists of people who had attended various events over the last few years. So that's a total of twenty four files with about ten thousand names and addresses. The good news was that they had plenty of data. The bad news was that, considering the sources, it was a pretty good bet that a significant number of the people on the lists were duplicate, disbarred, deceased or undeliverable. Our first step was to deal with the hardcopy list. Luckily it was printed in a fixed width dot matrix printer font, so it scanned reasonably well. Then we had to write a program which translated from an address listing to a column format. One surprise stroke of luck was that some of the names in the list were in the form 'Mr. Clarence Darrow, Deceased.' We had already found some names to eliminate! A one-off approach Next, we had to get all the various files into the same column format--first name, last name, street address, etc. This is a fairly simple programming effort, although it was complicated by the fact that the partners had been "creative" in using their contact managers, putting county names in city fields and so on. This is, unfortunately, an ad hoc effort. You just have to see what you've got and deal with it on a one-off basis. Luckily, once you understand the problem the programming effort is minimal. For reasons which will become clear later, we also added a field which would tell us which of the original files the record came from. Now that we had all the information lined up, we ran the addresses through CASS address correction software. This picked off a few more addresses which were just plain undeliverable. More importantly though, it normalized the addresses which remained. That is to say that we could compare addresses without worrying about abbreviations and common misspellings. At this point we could find most of the duplicate records with a simple comparison program. By looking at a few name and address fields we could remove clear duplicates. So if Ms. Alissa McGregor was on two of the partners' contact lists and had attended one of the events, we reduced her to one record instead of three. This step removed about thirty five hundred duplicate records. Up to this point the computer had done most of the work. Now the list needed a little human attention. We wrote a program which would find possible but uncertain matches and display them on a screen. Then a person could pick any or all of the records to remain in the list. So for instance Edward and Edwina Massie are probably two different people, but Ed and Edward are probably the same. If we found a single person with two addresses then the source of the data became important. If one address came from a partner's contact list and the other was from an event the person attended last month then we would know to take the latter. In some cases we would just ask the customer. This step took several hours but removed another one thousand five hundred duplicates, and did it in a much more intelligent manner than any program could have done. Movable line It is worth noting that the line between what is done automatically and what is done by 'eyeball' is movable based on the application. If you need high quality and are willing to pay for it then you will do more inspection. If we hadn't wanted to do any inspection then we could have made our de-duping program a little more ruthless, and Edward and Edwina would only have gotten one piece of mail. In a large data cleanup effort of hundreds of thousands or millions of records, we will typically include any rules that come from the inspection process in a program, leveraging the manual effort. For ten thousand records it just wasn't worth it. At this point the data was 'clean enough' to do the mailing. An important final step was to repopulate the partners' contact lists with the new data. The customer also had to put processes in place to ensure that only these 'master records' were updated. We also periodically run their new customer list through the process to clean up any gradual problems. If this is not done then of course all future marketing activities are equally difficult and expensive. So that's it. We started with a list of ten thousand and found about five thousand duplicates, deceased or undeliverable. It took about one hundred hours of person time, much of it spent eyeballing the data. The inspection time actually could have been spent by the customer had they chosen. This would certainly have decreased the cost and would probably have increased the quality of the decisions. The computer resources are trivial for a project of this size. This is a specialized activity, and it's not really the sort of thing that most programmers aspire to. Because of the uniqueness of many data issues you can't buy software to clean up your data for you--it really does require a more customized effort to get the best results. Hopefully I have armed you with enough information to help your customers get past this critical roadblock and send you more data to feed your digital printers.