Making Sense of Data: Data to Data Mining

Making Sense of Data: Data to Data Mining

What is business intelligence and most importantly why do we need it?

The volume of data and the way we store our data has evolved drastically with the changing business scenarios. Expanding businesses and increasing competition has led businesses to appreciate the information and insight that data can provide. Business intelligence tools give organizations the ability to extract information from the data they possess.

Surprisingly business intelligence has been around since the time businesses and data have been around. The concept is just the same only the terminologies and technologies have changed over the years.

Business intelligence encompasses data warehousing, reporting, data modeling, data mining, text analysis and predictive analysis.

Let us explore how our expectations from the data has evolved over the years from a business perspective:

Let us take an example of a XYZ multinational company based in Tokyo, Japan, into manufacturing automobiles. We will try to understand how the business scenarios changed as the company made a transition from manufacturing engines to automobiles.

The company started off with manufacturing engines and automobile parts in the 1960’s. During those times data was generated and recorded by various transactions like customer order, shipment to the customer etc.

Even during those times, the managers could ask and get answers to questions to increase their productivity. Questions like which part takes more time to be manufactured? Which part gets ordered more frequently and by which customers?

These questions were easy to answer as the data was being recorded and collected in one database at one location and was also less in volume. It was possible to manually collect and analyze the data to get these answers.

Why data warehousing?

Over the years the company grew and started manufacturing automobiles rather than just manufacturing automobile engines. Subsequently the company had its manufacturing units in more than one city and state. It had many showrooms in many cities. Each manufacturing unit and showrooms stored their transactional data or day to day data in their own databases with each of them following their own standards and terminology to represent data.

To provide answer to questions like “which car or two-wheeler gets sold the most and in which state”, to the higher management headquartered in New York, was a cumbersome task. It required getting data from all the showrooms and a lot of jugglery with data to answer the desired question. Even then there were problems with duplicate data, inconsistent data and multiple formats of data.

The management required quick answers to formulate policies and make decisions to enhance their businesses.

This is where data warehousing came to the rescue.

What is data warehousing?

Data warehousing is the task performed to create a data warehouse.

A data warehouse is an integrated data base of the enterprise where all the data is stored centrally. The data before being stored is cleansed – unwanted data is removed, consistent formats of data are maintained and duplicates removed. This helps in maintaining consistency throughout the organization. Searching for answers now becomes easy as all data is in one place and in a desired format.

Creating a data warehouse is like giving the organization a memory.

How to do data warehousing?

Three major processes are followed while doing data warehousing namely ETL.

  • E- Extraction. Extracting data from all the varied data sources for example getting data from all the manufacturing units and placing it into one database
  • T- Transformation. Applying transformations to data. For example remove null values or duplicates from the data. Change the format of data to maintain consistency.
  • L- Loading. Loading the transformed data into the centralized database. This creates the data warehouse.

With the data warehouse in place we have a repository of the historical data of the company at one place.

Now answering questions like “which vehicle registered the maximum sales and where?” is easy and not much time consuming. We can create a multidimensional data model for example cubes which helps to get a 360 degree view of data, enabling us to look at data from various angles. We can generate reports easily and effectively.

Why data mining?

The company wishes to maintain its market supremacy but the competition has grown enormously. The company wants to know what the future market holds for them and how to stay a step ahead of its competitors. The company wants guaranteed return from investments by mitigating risks involved in intuitive predictions.

Now the company wishes to launch a new promotional campaign and wishes to know which promotional scheme will generate the maximum return on investment in terms of vehicles sold and why? This will help the company in planning according to the estimated customers and cut down costs.

To find the answer to the question we have to dig through the enormous amount of data to find a pattern among the customers which have responded well to schemes and to which kind of schemes.

To enable this we do data mining.

What is data mining?

Data mining also known as Knowledge discovery in data; is the process of finding hidden patterns, trends and alerts in data. Data mining is done usually on a sample of data taken from the data warehouse. To do efficient data mining a data warehouse needs to be in place. Many statistical techniques are available to analyze data and find the patterns or correlations in data.

Performing data mining is like going through the memory provide by the data warehouse and identify trends and patterns and make predictions based on scientific techniques and not on intuition.

How to do data mining?

There are three major phases involved in data mining

  • Sampling of data– the data present in a data warehouse is explored to collect important variables and parameters which appropriately represent the larger set of data
  • Building models or hypothesis– using the subset of data the patterns are identified and hypothesis proposed. This can be done by applying any of the statistical methods
  • Validation– the identified patterns or hypothesis is validated by applying the model to new subset of data. After validations predictions are made

Various tasks can be performed to identify patterns, trends and make predictions. Few of them are:

  • Classification – arranging data into previously identified groups. Example classifying t-shirts sizes in small, medium, large
  • Clustering- arranging data in groups that are not predefined. Example classifying t-shirts sizes according to body types- broad shouldered, narrow shouldered
  • Finding association rules– identify rules that link data together. Example people who buy dog food may also buy a dog bone
  • Regression- defining functions that can best predict values for numerical data. Example predicting the estimated value of a house after considering all the various possible factors

Data warehousing vs. Data mining

Data mining is performed on a sample set of data collected from the data available. A data warehouse is required to do data mining.

In a nutshell

Business Intelligence helps businesses make intelligent, fact based decision.

Data warehousing is the first step towards integrating and consolidating the humungous amount of data that an organization has. It provides the organization with a memory which supports decision making.

Data mining needs integrated and cleansed data to find patterns and hypothesis to make predictions. The data needed by data mining is provided by a data warehouse.

By undertaking business intelligence activities the organizations can stay ahead of their competitors by mitigating the risk