Date Scrubbing: Clean Up Your Act

Cleanliness is next to godliness where data is concerned, but it can be difficult to convince your customers that theirs is less than squeaky clean. Annie Gurton gets data scrubbing

Dealers who sell information systems running low-quality data will be blamed if the output is unusable, so you need to make sure your client?s data is clean. One of the legendary sayings of the computer industry is ?garbage in garbage out?, or Gigo. Information is only ever as good as the data it is based on, and if that is unreliable the whole information edifice will come crumbling down. If the data held within a system is inaccurate, duplicated or low-quality, then business decisions based on that data are going to be garbage too.

The IT manager who allows an information system which is fundamentally impure to be implemented and remain in place should, obviously, not be in the job. But it can be very difficult for IT managers who have been close to a system or a set of processes for a long time to see it clearly. They become used to a system which has quirks and shortcomings. Cue the system integrator or reseller, who is not only ideally placed to help an enterprise develop and to implement routines which ensure cleaner data, but also has the right logical approach which in-house personnel may find difficult.

James Evason, data warehouse business manager with tools vendor VMark Software, says integrators and resellers are the obvious group to evangelise about clean data. ?Resellers and integrators see data cleansing as an essential part of the implementation process and ensure that it is included in the budget and on the project list of actions. An in-house IT manager is more inclined to take any suggestion that the data needs to be cleaned as a personal slight.?

There are several reasons why data may be dirty or just not arranged in the right structure to deliver the required information. One is that a system has been implemented in haste and too little regard has been paid to the content and structure in the early stages of its development. Another is that impurities can creep in over time, or there may be errors at the point when data is keyed in.

Graham Auton, MIS director of the Rugby Group, which used Alternative Business Systems to implement a data warehousing solution, says: ?The key to having good information coming out of your system is to ensure the structure and basic data are correct and reliable. Users are going to make critical decisions and the information they base those decisions on must be 100 per cent reliable. Once users lose confidence in the system it is very hard to get their confidence back.?

Auton says that having incorrect data is worse than having no data at all. ?It is too easy to be tempted not to spend time cleaning and structuring the data properly, yet systems which are rushed may turn out to be worthless,? he adds.

Rajan Anketell, managing director of management consultancy Anketell Management Services, agrees that the causes can be hard to identify, but often come down to failure to integrate the information system within an IS strategy.

?Many IT people fail to acknowledge the more complex business issues and the increasing sophistication of users,? says Anketell. ?But in order to arrive at clean and useful data it is essential to understand the business thinking behind the information gathering and the way the data is to be used.?

He adds that it is not just a matter of having clean or unclean data, but of having data in the format and structure which makes it most useful. ?Data may be clean in that it is accurate, but too irrelevant, superficial or wrongly structured to make any sense or be of any use,? he says.

Consultants and sales staff in integration and reseller businesses are increasingly the ones most able to take a ?helicopter view? of their clients? business and make suggestions which will improve the data cleanliness and suitability for later mining and decision support.

Evason says: ?A simple example is the way that dates are used. Dates may be collected on the day that stock comes in, when it is purchased by the customer, and then there may be a delivery date and perhaps a date when the transaction is invoiced. Yet the business will want to analyse that transaction either weekly, monthly, quarterly or annually. This discrepancy between the type of date used by different parts of the organisation can cause a big problem when it comes to making any sense of the data or using it for analysis and decision support.

He adds: ?Problems arise because users want everything immediately without using proper implementation procedures, and they are inclined to see the hardware as the most important part of a system.?

This view is reinforced by Mike Briercliffe, channel management consultant, who says: ?It?s true that too many users think first about the hardware platform and are just looking for a brand, and it?s true that it is the channel resellers who increasingly have the experienced staff, able to ensure that data and information systems deliver what is required.?

They do this, he says, by employing people who are more business analysts than IT salespeople, able to think about the customer?s business and its processes like a management consultant. ?Getting clean data once required calling in a programmer, but these days it first requires a sophisticated understanding of the business functions and strategies,? he adds.

Evason agrees that the first step to clean and usable data is to have the right perspective on it and what is required from it. He says: ?If you ask any IT manager if he has data quality problems, of course he?ll say no. That is human nature. These are professional people, highly trained in the craft of structured analysis, design, coding and project management. They find it very difficult to admit that their data may be ?bad? but the truth is that most of them have problems ? even if they don?t want to realise it. ?Human nature and human error play a large part in compounding and sealing data problems so deeply into a system that they are extremely hard to spot and change.?

Software now plays a large role in the data cleaning and restructuring processes, so that it is more likely to be a consultant than a programmer who is charged with the task of cleaning up a system.

One of the few times that IT directors are told that they can have a carte blanche to do as they wish with a data scrubbing and consolidation project is when there is an acquisition or merger. Evason says: ?There are legal time pressures to make sure that all the systems and data of two merging companies are consolidated, but generally speaking, the IT manager is told: We don?t care how long it takes to do it, just get it right. This is very rare. Usually the IT manager is trying to clean up data and improve the structure as he goes along, on the hoof.?

Briercliffe agrees that when two companies join together there are often vast system problems, and the most common way to deal with the situation is to take one system and try to load the data from the other system into it. ?This often results in chaos and a final system which is inadequate, with lots of wastage because elements of both systems are incompatible,? he says.

Gary Smith at Red Brick Systems sees internal business reorganisations as opportunities to spot weaknesses in the information system. He says: ?Internal reorganisations are a time to put the spotlight on the quality of data, but they are also a good time to initiate projects to clean up and scrub the existing system and data.?

According to Smith, resellers and integrators who have been setting up a hardware system in an organisation should take the opportunity to turn the conversation to the application of the system and how it is going to be used in real life.

?Sales is all about looking for and creating your own opportunities,? says Smith. ?And making sure that a system is not going to be loaded with unclean or low-quality data is a natural opportunity.?

Evason shares this view. ?Integrators should be suspicious of the quality of data which an organisation proposes to use and build a process to attend to the problem and correct it. It is vitally important to do this, otherwise the integrator or reseller will be the target of accusations of selling a system which cannot deliver what it promised. Integrators have to ensure that the data quality is high, otherwise they will find themselves in the situation where they have to defend themselves.?

Part of the process of system implementation should be ensuring that data is clean, he continues. Cleansing and transformation to an appropriate structure should be a routine part of setting up a data warehouse or information system. ?People think of the process of converting data into information as simple extraction of facts from one source and putting it through a decision support tool or using information analysis techniques. But failure to plan for poor quality data invites disaster,? he says.

The software tools on the market that facilitate cleansing and restructuring, or transforming data into usable quality, are increasingly easy to use and do not require a programmer or detailed fine-tooth-combing of data.

Evason continues: ?We have a product called Datastage which allows the consultant who has little technical or programming knowledge to perform cleansing and transformation routines. Although the requirements have to be input into the software, once programmed it will pass data from various sources through various filters, resulting in output which is exactly as specified by the business rules that the software has been given.

?Data rejected from individual cleaning stages is collected in discreet files for summary and reappraisal. Sometimes it is rejected and sometimes it can be updated for reassimilation into the system.?

Transformation engines are a fairly new category of software, likely to change and improve the ways in which data can be used. Evason says: ?Transformation software is a way of really adding value to data. It takes data supplied from several sources and transforms it into different ways of viewing it. It identifies and represents complex business relationships from scraps of data hidden in multiple systems, even buried in free-form text records.?

The demand for clean data is set to grow. The trend is linked with the seemingly unstoppable upward curve of the sheer volume of data which most companies hold, together with increasing intolerance by users of systems which are not delivering exactly the right business information.

Extracting reliable business information by using software will soon be the only way of making correct business decisions can be made, and ensuring that the data which information is based on is clean is a task which resellers and integrators can adopt.

?Selling an enterprise the hardware and software to collect and manage information is only part of the challenge. Resellers which want to stay competitive have to provide services such as data scrubbing to ensure that the systems they are providing are really fulfilling their potential and promise,? Briercliffe says.

With the number and frequency of mergers, acquisitions and internal business reorganisations on the increase, the ability to provide data cleansing services is a growth opportunity for integrators and resellers, and a good way to provide differentiation from the others.