BI reflections

Just another weblog

Overall subject reflection February 11, 2010

Filed under: overall subject reflection — Lee Shi Huan @ 1:47 pm

To sum up, BI has introduced me a new perspective to look at data and analysis. I have learnt that BI is not solely a technology subject but also a business subject. This is so as through the software, many important analysis can be deduced which is important for a business to survive or revive. Also, to implement the system blindly will not work out and there is also a need to learn and understand the business itself first. Thus, this subject actually is an all- rounded one. It has got me thinking and is very interesting.


Implementing BI

Filed under: All topics,Implementing Enterprise BI system — Lee Shi Huan @ 1:41 pm


by Greg Mancuso and Al Moreno


Today, enterprises are moving forward to integrate business intelligence (BI) and business performance management (BPM) applications into their infrastructure. There are a variety of reasons for this change, including gaining competitive advantage to pressures exerted by regulatory compliance from the US Sarbanes-Oxley Act (SOX), the Basel II Capital Accord, and the Health Insurance Portability and Accountability Act (HIPAA). Whatever the reasons driving your organization to implement these solutions, it’s important to remember that deploying these technologies involves far more effort than drag, drop, and go.

While the benefits of these new technologies far exceed the negatives, organizations must be aware of the substantial effort required in the successful deployment of these tools and of the fact that these applications may place stringent requirements on a company’s infrastructure. Such pressures can cause great pain to the IT organization as well as to the business side of the house. Coping successfully with this pain requires a tightly coupled partnership between IT groups and the business community. Both sides must recognize the limitations and boundaries within which they must operate.

Many enterprises today contain sophisticated levels of specialty applications designed to provide members of organizations with strategic pieces of information that allow quick reaction to industry activity and change. They afford groups a highly focused view of key business components and vital information by filtering out less pertinent information. When combined with integrated and refined planning techniques as well as revamped business processes, these tools allow for near-real-time monitoring of almost every aspect of an organization.

To date, the struggle has been to grow the ancillary systems and tools so that organizations can better integrate the various sources of information within their data centers and in users’ heads. This allows for an increased level of detail and a more complete picture of the business situation at any given time. This picture can be monitored and compared with the company’s budgeting, planning, and forecasting processes, as well as its strategic goal setting. This ability to plan and monitor began as a primitive, highly manual, and one-cycle pass process. Over time and through a massive organizational effort, it has evolved into a much more iterative and interactive set of activities. This evolution has mandated a new, highly evolved set of business and technical processes that are intertwined with and complementary to one another to provide instantaneous access to the corporate world of “information.”

This report addresses this maturation process, from its evolutionary beginnings to what we foresee as future trends. Figure 1 depicts the typical key components and environment that results from the evolutionary process of an IT organization. Users become dependent on technologies and draw information from a variety of systems into their own workstations. Here, through either the use of Excel or Access or other similar technologies, users find ways to merge the data they need to produce the complex reports of today’s organizations.

Figure 1Figure 1 — Corporate data silos.


Need for Organizational Infrastructure

Previously, organizations could work with disparate data at a leisurely pace and had the luxury of time to transform data into usable information. The dawn of e-commerce, overnight mergers and acquisitions, and shortened regulatory reporting periods no longer affords this luxury. Companies must have infrastructure in place that permits instantaneous access to highly refined pieces of data. Further, these infrastructures must have processes in place so that data can be automatically processed with sophisticated business rules and technologies, with minimal human intervention, into reliable and usable information.

It is not difficult to understand how the world of IT has undergone massive revamping over the past 10 to 15 years. As the level of growth and the capabilities of systems to process volumes of data have matured, so too has the dependency of organizations on automated analysis. Unfortunately, growth in tool capabilities has occurred much more rapidly than most organizations can manage. As a rule, IT organizations are slow to adopt new technologies and must enforce their existing change management processes to ensure stability of the data infrastructure and validity of their information. As a result, the typical business previously had a loose infrastructure that existed to support what was in place and that expanded as necessary to accommodate the new pieces of technology that were required to keep the latest additional operational systems growing. As long as an enterprise could survive with the information and level of integration these makeshift infrastructures provided, organizations could function. While things seemed more the product of a kludge than a concerted effort, as long as minimally acceptable levels of data flowed forth, organizations could live with what they had.

When the world of business entered the electronic age, these architecturally unsound infrastructures could no longer meet organizations’ needs; organizations strove for competitive advantage using transformed data. IT was faced with the harsh realization that it had to maintain what it had put in place while simultaneously revamping the organization’s increased demand for more detailed, timely, and sophisticated levels of integrated information. Many organizations awoke to the reality that their infrastructures couldn’t support the new technological demands and that they were in the dire situation of having to increase efficiency and productivity without destroying the old operational systems — in many cases, with inadequate funds, reduced personnel levels, and a demand for new and unfamiliar skill sets. Many organizations couldn’t support the new requirements and were forced to start from square one while struggling with new demands, increased governmental business regulatory requirements, and new sophisticated BI/BPM tool suites that imposed rigid data requirements unsupportable by current infrastructures.

Business Versus IT Needs

Under such conditions, the business-IT relationship has grown increasingly at odds. In part, this tension stems from fundamentally different perspectives. IT is concerned with the installation, maintenance, and hardware and software implications of new demands. The business perspective, on the other hand, views IT as data keepers who simply need to plug in the new applications and turn on the faucet so that the information flows. IT must be responsive enough to provide the required data and services to business users. Meanwhile, the business community must be patient enough to give its technology support staff the time to build the requisite infrastructure to support the new tools in an orderly manner. If either side exerts pressure to rush the deployment or the data, it can lead to a haphazard installation that won’t satisfy the operational needs of business users and the technical needs of IT.

IT organizations have also contended with massive changes to infrastructures at a time when an unfavorable business climate has curtailed funds and staffing levels. The struggle to bridge the chasm between ramping up the delivery of information and maintaining required daily operations has resulted in a full-time juggling act for many within IT management.

Unable to cope with the perceived inactivity and lack of responsiveness of IT, many users have turned to self-deployed departmental solutions, which typically solve only short-term tactical issues and immediate business needs. Unfortunately, these departmental solutions themselves often become invaluable assets to departments. Business departments will adapt and survive, often by increasing their reliance on the tools that make their lives easier and job responsibilities attainable. This also leads to various departmental applications growing to the point at which they become long-term maintenance headaches that exceed a department’s capabilities. Eventually, these applications get moved under the responsibility of IT departments. This places further strain on IT, because departmental solutions can blossom into enterprise-wide critical decision-making applications that impose additional needs and requirements. The end result is a cycle with no apparent end: users ask IT for more; IT cannot respond on a timely basis; departmental users look for interim solutions; solutions become long-term applications and further weaken IT’s ability to implement long-range enterprise solutions.

Aside from the obvious strain on organizational resources, following this line of IT application growth creates high costs for any organization. Often, different departments select different toolsets that have overlapping functionality, creating a nightmare not only from a cost-of-initial-acquisition perspective but also from a maintenance perspective, since many of these solutions require hardware, software, and maintenance skills not readily available in the organization.

Proactive IT

Putting an end to this cycle requires that an organization make some extremely difficult choices. Organizations must step back and take stock. The cost of doing nothing eventually forces a decision. So it’s better to designate a time for extensive self-examination, which necessitates significant effort and commitment. For organizations that have successfully moved past the described situation, the common thread is a willingness to take a hard look at infrastructure and to take the time to do some planning. Organizations have incorporated use of the many theories, such as the Corporate Information Factory by Bill Inmon, Claudia Imhoff, and Ryan Sousa. As depicted in Figure 2, if you look at the heart of such a strategy, it is commonly advocated that its structure not only be built on existing organizational needs but also address the inevitable events that take place in any large operational environment.

Figure 2Figure 2 — An information interchange strategy.

The partnership between IT and business is created by both sides sitting down to voice their concerns, issues, and responsibilities so that both groups can work successfully toward a common goal: creating applications that benefit the enterprise without overrunning budgets and irreparably damaging valuable relationships. By working together, both sides can accomplish their objectives, and the organization can benefit over the long term by having an infrastructure in place that not only services current needs but facilitates the assimilation of future growth into the enterprise. As the organization grows by either expansion or acquisition, it can add new applications and solve business problems with relative ease and in a short time frame. New applications don’t have to be a source of constant panic and upheaval.

Foresight and Vision

Today, an organization’s every IT move must be part of a master plan to support the organization. IT shops must have a firm handle on today’s needs, a strong understanding of the business, and an uncanny ability to foresee the future. While previously it was acceptable to build infrastructures that would support the immediate needs of today, current needs may extend as far as three to five years, and the failure to take a longer-term view will probably create more issues than the time taken up front to plan. It is safe to assume that even if you don’t have a current need for a given functionality, you will when another acquisition or merger forces the need. While it isn’t feasible to identify and design for every possible scenario, today’s IT organizational planners must have a vision and purpose that provides for easy incorporation of future technologies into the existing infrastructure.

Ten or 20 years ago, few businesses foresaw the onslaught of e-business and the need to satisfy a variety of internal and external user requirements. Organizations today provide valuable Internet and intranet functions that only a few years ago would not have been considered a business requirement. Those organizations that have suffered least in recent years had vision, exercised planning, and thought outside the box. Further, organizations often mistake planning for doing. Understanding that IT infrastructure must accommodate future growth doesn’t translate into buying every piece of hardware or software to support those needs right now. A successful infrastructure can absorb requirements and incorporate the latest technologies because these events have been planned for. The advances in storage, memory, and technology in general facilitate this planning. It has become virtually acceptable for vendors to develop hardware with a built-in obsolescence factor. Newer and better equipment is now being produced on an annual basis, with platforms becoming obsolete in two or three years. Organizations that build their infrastructure without accounting for such rapid change will find their ability to stay current severely hampered. Maintaining a competitive edge necessitates that you understand what your competitors are doing and plan for similar functionality.

Building the Organizational Infrastructure

Clearly then, the task of creating the best environment to support the new wave of technologies must include a good deal of thought and planning. As business mandates new requirements, IT can respond in a timely fashion and avoid the perception of being unresponsive. IT managers today must plan in four key infrastructural areas. First and foremost, any infrastructure must be able to provide operational support for the existing day-to-day systems that drive the business. These systems include accounting, customer relationship management (CRM), enterprise resource planning (ERP), payroll, and HR systems, to name a few. Today, a variety of packages are written and supported for the most popular relational databases and function on common hardware platforms. IT support for these systems is now fairly commonplace and involves maintenance of the systems, backup-and-recovery processing, and general manpower to support the functional side of the business. Many IT shops do not understand that today’s analysis tools draw from multiple systems of record and need tightly coupled and integrated information that has been highly cleansed and verified.

The second area of support and services involves providing data integration and allows for accumulation of large amounts of data from various systems. This environment typically contains data warehouses, data marts, and operational data stores. These repositories of data must be highly optimized for end-user queries and use, must be verifiable, and must permit the total audit of information and transformation back to the systems of record. This data can be tailored and customized, which then serves as the basis for the next phases of use. It is here that the individual data elements are married, matched, and altered from a set of discrete, disparate data to a coordinated and potent source of knowledge. Data maintained in this environment has been cleansed and matched in the transformation from disparate data into integrated information.

Third, the area of corporate reporting may also include a separate or complementary set of tools that permits the analysis of historical information, which can become a competitive knowledge base from which management can draw solid and verifiable information. In turn, this information forms the basis of management’s strategic action plans. Because the data contained in the repositories is solid and verifiable, it is only logical that decisions based on this knowledge, when coupled with executive management’s expertise and understanding of the business, yield valid and attainable strategic objectives that provide an enterprise with a strong competitive advantage.

Finally, the organization requires a means of disseminating information from the top down as well as from the bottom up. IT must have a firm handle on the corporate communications infrastructure, since requirements in this area may involve everything from intranet reporting to providing customers with transactional information about their purchases, complete with price analysis and historical trends. The communications infrastructure must be robust enough to handle volumes of Internet traffic, large amounts of electronic data movement, and a means of information exchange between systems and corporate entities. In general, the network can be your best friend or your worst enemy, and its role is dependent on bandwidth and the ability to expand with the growth of the enterprise. In short, this area, too, requires good planning and the ability to foresee future needs within the organization. The emphasis should be on planning for expansion, purchasing, and implementing only what is required for today while maintaining the ability to incorporate future needs into the existing infrastructure easily and with a minimum amount of pain.

BI/BPM Application Mandates

You might wonder why it is necessary to place such heavy emphasis on a strong, stable, and flexible infrastructure when moving into the BI and BPM arena. Companies that implement these applications gain significant competitive advantage, and the use of these toolsets creates a consistent ability to dramatically improve the accuracy of the objectives of executive management. These technologies thrive on the ability to have massive volumes of information flow from their systems, honed by BI/BPM applications, to provide near-real-time analysis and monitoring of conditions as compared with stated goals and objectives.

Traditionally, organizations created organizational direction through a highly time-consuming and manpower-intensive process. The process further required that strategic direction be set annually and the ability to provide feedback on some fixed basis, usually monthly. With the advent of today’s BI/BPM tools, it’s now possible to provide massive amounts of consolidated, cleansed, and coordinated information on an ongoing and continuous basis. This information is more timely than previously and is sufficiently easy to use, so even the most computer-illiterate manager can access the information in a format he or she can easily understand and manipulate. This ability allows constant oversight of the strategic direction while at the same time permitting and empowering management to make midcourse corrections with confidence. The tools’ capability to provide detailed information from multiple points of view allows decision makers to readily make the changes required to drive new business opportunities and to retune every level of the organization.

In addition, these new toolsets and applications allow both strategic and tactical information to be directed to the organizational levels where this information can do the most good. Using Internet and communications infrastructure, field personnel can receive daily results of these efforts, permitting them to identify issues and areas of concern long before problems arise. All such capabilities, of course, are contingent on the ability of the infrastructure to support organizational needs to present the information in a manner and at a pace that serves the needs of the entire organization.

Corporate Data Structures

When trying to understand the totality of corporate data structures, one must account for the fact that several layers of data flow through an organization. Three key components of corporate data structures must be accounted for. The first is the operational data structure. This consists of structured data environments and unstructured data environments. Figure 3 illustrates the key components of these two worlds of information.

Figure 3Figure 3 — Corporate data structures.

This layer of corporate data comprises the world of hard-and-fast facts, figures, and dollar amounts. To date, organizations have spent almost 100% of their resources tending to the structured portion of the enterprise. Operational systems have been honed and carefully orchestrated to deliver high-capacity storage and retrieval of this key corporate data. Customer names, account histories, and historical financial dealings have all been captured, categorized, and optimized for analysis.

Oddly enough, the new regulatory requirements of SOX and Basel II have initiated a new trend in the business world. SOX and Basel II now require much closer monitoring of unstructured information consisting of paper documents, e-mail messages, and corporate communication. Organizations now recognize that there are few software options to deal with the need to merge the structured and unstructured worlds of data. Organizations capture orders and financial dealings, but few have ever contended with the volume of customer interactions that now occur electronically through e-mail or postal mail; nor is the massive amount of paper generated daily collected or captured. The information contained in letters, e-mail messages, and other corporate communications is now considered as important as the information contained in the operational world.

Existing back-office systems capture some information, but few, if any, tools are available to categorize and merge the information captured with structured information, resulting in the loss of significant information that could clarify and improve decision making. For example, the reason for the fact that some customer behaviors don’t conform to corporate expectations may well lie buried in these volumes of untapped data. This is evident if you consider the volumes of unstructured information. Industry experts such as Inmon set the ratio of structured to unstructured at roughly 20%-80%. By extension, 100% of companies’ decisions and goal setting is based on only 20% of the available information (a variation on Pareto’s Principle).

The fact that organizations make decisions using only small amounts of structured information is further complicated by the lack of analytical tools to support the use of unstructured information. Most analytical tools today address structured information. The current crop of analytical tools uses facts, figures, and digitized information to look for trends, report on collected data, and provide BI/BPM applications that dissect and manipulate the available information. While previously these capabilities sufficed, the trend is a move to grab, categorize, and make use of unstructured data and, equally important, to use the unstructured data in the context of the current structured analysis.

This seems to be the wave of the future as enterprises struggle to move forward in search of competitive advantage. Spinning spreadsheets will no longer provide the benefits required to maintain industry leadership. Corporate competitive advantage is now largely contained in the unstructured information of e-mail and hard-copy communication. The difference between corporate excellence and failure has diminished so significantly that organizations that make the transition to incorporate this unstructured information will gain the definitive competitive advantage.

The final layer of corporate data, metadata, is often totally overlooked or underdefined. Metadata means different things to different members of an organization. In the traditional sense, metadata is informational data about the data that resides in your corporate systems. There are also multiple types of metadata, such as technical metadata. This information refers to the data elements that make up your IT systems. The name of the data elements, the length of the data, the type of the data, and so forth are all characteristics of your metadata. Examples would be the data element name CALC_PCT, calculated and stored in the metadata repository as real (or float, or floating point). Information may also contain the database table names in which the element is stored.

By contrast, there is also business metadata. Business metadata stores the reference information that makes sense to end users. This data may include the common name for a given element, such as margin percent, the reports it is used on, the system it is sourced from, and an English definition of the calculation formula. The audience for this data includes those looking at a saved calculation and wanting to know the underlying data elements used to create the cell in question.

Most enterprises either totally ignore the business metadata requirement or only populate the most basic information. Previously, this was acceptable, but as BI/BPM toolsets have taken hold, the business model and users have changed; users no longer look solely at report output because not only IT programmers but also business users work with these tools. Suddenly it has become imperative for organizations to understand the data that was stored from a technical perspective as well as an end-user perspective. This information must involve more than the cryptic database column names as understood by IT staff, because regulatory mandates force end users to be aware of the information they use and assign blame to those creating informational outlets for the company. In addition, users have added a new layer of “metadata” in the appearance of reporting structures.

Reporting is more than just formatting columns and rows. Specifically, it is a method of visually displaying the progress of the company for everyone to see. All reporting involves creating report levels, or hierarchies. Hierarchy management is a way of ensuring that all members of an organization use a consistent method to create reports. Hierarchy management now plays a vital role because SOX mandates a clear, concise, and repeatable reporting format. Hierarchies are logical rollups that can be disseminated to all the systems used for reporting and data analysis. By ensuring that hierarchies are maintained and managed in a central repository, it is safe to assume that everyone in the organization shares a consistent view of the business — that is, of course, if everyone uses the same reporting structures. Many new products are emerging to perform this vital and often-overlooked functionality. Figure 4 depicts the relationships related to reporting hierarchies.

Figure 4Figure 4 — Organizational hierarchy management.


Business information needs vary greatly from user to user. Many corporate technology infrastructure users reside mainly in the tactical space; that is, they are business users with the sole responsibility of performing data entry for companies’ various operational systems, such as order entry, G/L, ERP, and CRM applications. These users are often supported with targeted report output from these same applications that satisfy the various QA activities associated with their job responsibilities. Questions such as “Did I enter that order correctly?” and “What are the phone numbers for this customer group so that I can call and verify their information?” are common. As one moves up the hierarchy of the corporate structure, the need for more strategic or analytical data increases. Once those at the C level are using this data, questions rarely, if ever, concern the operational. C-level information needs center on corporate financial performance displayed and analyzed from multiple points of view or with inventory management and optimization based on a varying number of criteria.

Both of these disparate informational needs must be satisfied by the company’s technology infrastructure and various online and offline data repositories. Even more important than the business community’s lack of understanding about the distinctions between operational and strategic informational needs is the lack of understanding between the business community and IT support staff. The gap can never be fully bridged unless all parties take the time to jointly assess the current state of the company’s strategic direction, business processes, and technology support structure.

In order to successfully bridge the gap between these two islands of information requirements, it is imperative that one has a complete understanding of the company’s corporate information systems and that the organization performs a thorough analysis of the current and planned information systems.

Corporate Information Systems

Every company comprises many silos of data. Many of these silos house similar, if not completely redundant, data. Most commonly, each information system was designed and implemented to support a specific tactical or operational need. Companies build or deploy order entry systems to place and manage their customer orders and requests. Companies build or deploy customer service systems to support their existing customer base with new products, services, or other contact needs. Companies build or deploy sales automation systems to support their sales force and/or maintain their potential customer information. Alternatively, companies may deploy CRM solutions to satisfy many or all of these needs. Companies must also manage their financial and regulatory offices by building or deploying any combination of G/L, accounting, ERP, HR, inventory management, supply chain, material requirement planning (MRP), or other custom-developed tools and systems.

It is obvious how data collection and storage mechanisms can grow quickly. These tools and systems all require similar core data to function while also gathering significantly broad, diverse, and profuse amounts of other pieces of data. Even with their intrinsic ability to collect and display huge amounts of data, each of these types of applications is built to run day-to-day organizational operations. Sales cannot be fulfilled without functioning order entry, prospect management/CRM, or inventory applications. Factory output cannot be accurately projected and staffed without functioning order entry, inventory, supply chain, or MRP systems. Payroll and operating budgets cannot be projected and/or processed without functioning HR, G/L, accounting, or ERP applications. In short, a company requires each of these applications to open its doors in the morning.

On the other hand, management does not run a business in the same way that the business operates. Executives must look at information from several of the disparate operational systems and derive a view of the company’s state in order to build strategic plans. To accomplish this, companies deploy analytic applications, such as data warehouses, BI reporting tools, OLAP cubes, dashboards, and scorecards. Just as these applications are used to define and drive the company’s strategic direction, the analytic applications require significantly different design and implementation tactics than do operational systems.

Analysis of Corporate Information Systems

Regardless of the purpose of the information application — operational or analytical — no system can live in a vacuum. While the operational systems are designed to operate solely to support limited sets of critical day-to-day corporate functions, they routinely need to share key pieces of data. Also, operational systems will feed analytical applications. In order to accomplish this sharing of information, the organization should perform a complete analysis of all its information systems. In addition, the strategic analysis of the organization’s data infrastructure should include an identification and assessment of potential future requirements. For example, the company does not currently use any CRM application; however, growth in business has necessitated the systematization of sales and prospect management activities. In doing this assessment, a strategic plan of the organization’s current and future information needs is developed. This plan will also include a high-level requirements identification and assessment of any known future requirements. This plan becomes a roadmap that may be used when deploying new applications in the organization or when upgrading or enhancing existing portions of the infrastructure.

The Strategic Plan

As we noted previously, the strategic planning process involves taking a close look at all existing pieces of the corporate data infrastructure as well as identifying and prioritizing potential future needs. When planning to deploy any new information resource or to add components to an existing solution, time should always be set aside to ensure proper planning. The strategic planning efforts will help ensure smooth deployment and transition to new capabilities. The planning effort leads naturally to the development of a consistent integration architecture and process. It also ensures minimal redundancy in an effort to achieve the end result for the organization. As with all information initiatives, the process is iterative, and in some instances, portions of a current process are reworked in later phases. The up-front identification of these redundant efforts provides integration planners with the ability to better group requirements to deploy the tasks without reinventing the wheel during every release.

The strategic plan should include the following components:

  • A technology architecture review
  • An assessment of all existing systems and applications
  • An identification and description of the desired future state of applications and infrastructure (including business requirements and hardware upgrades)
Technology Architecture Review

The current-state architecture assessment comprises several components:

  • A graphic rendering of current-state topology components (a topology map)
  • A topology map of planned future-state topology components
  • A detailed definition of all systems platforms
Current and Future Topology Maps

Figures 5 and 6 show samples of current- and future-state topology maps. A topology map displays all current technology components and how they are connected to one another within the technology centers. Note that these maps do not identify interoperability or application-level dependencies. This information is included in a subsequent section of the strategic plan.

Figure 5Figure 5 — A sample current-state topology map.
Figure 6Figure 6 — A sample future-state topology map.

The topology map depicts all servers, client machines, network devices, and communications or network links. The hardware should be identified by its system name and, space permitting, a brief summary of hardware, operating system, and database engine. For example, a server in the topology map could be identified as “ServerName: Sun v440, Sol8, Oracle 9i,” rather than simply as “Server.”

Communications links should specify their bandwidth capacities and, if possible, utilization percentages. This information is useful when planning for any capacity increases necessitated by new applications to be developed.

Platform Definitions

Every component in a topology map should be completely defined. This definition includes the hardware profile, the operating system level (with all patches identified) complete software profile, and user or process load information. For example, the order entry application server may be defined as follows:

  • ServerName: ORDENTSrv
  • Hardware: Sun v440
  • 4 x UltraSPARC IIIi
  • 8 GB RAM
  • 150 GB disk
  • OS: Solaris 8, Revision 7/03
  • Software: Oracle 9i
  • Sun Java Application Server Platform 8
  • User profile: one database administrator/system administrator
  • Three power users/app server administrators
  • 16 sales users
  • Process load: system is generally 40% utilized at any time
  • Peak utilization times are at the start and end of the business workweek and rarely exceed 60% utilization
Application and System Assessment

Key to the planning effort is the assessment of the application and systems. This assessment also tends to be the most time-consuming. Many organizations, especially those with larger and more established IT shops, run many applications and systems. Some of these applications have been in production for years, and the number of IT personnel who fully understand all parts of the application has decreased over time. Therefore, extra time and care should be paid to ensure that the applications are fully documented. The application assessment also includes a mapping of the data interdependencies between all systems and applications. Further, if they can be identified, all offline sources of information, such as Excel spreadsheets, should be incorporated into the application and system assessment.

The application assessment includes the following components:

  • A data interdependency map (i.e., data flow diagram)
  • A data interdependency analysis
  • An application summary
Data Interdependency Map

All corporate technology assets inevitably have interdependencies. As the order processing application receives information about a new customer, the servicing application must acquire this same information to ensure that the service representatives have the appropriate knowledge to deal with any call for support. In order to ensure that all dependencies are documented and understood, the strategic plan should include a data interdependency map, otherwise known as a data flow diagram. Figure 7 shows a sample data interdependency map.

Figure 7Figure 7 — A sample data interdependency map.
Application Summary

The final piece of the application and system summary is the application summary. This summary takes an in-depth look at each source of data that the organization currently uses. Historically, these data sources were the operational “systems of record.” More recently, the need to incorporate in a structured and systematic manner the various “personal silos” of data has presented itself. These silos include Excel spreadsheets, Access databases, or other information stored in offline resources. These personal silos should be considered when performing the application summary even if they are not currently used to feed data directly into an operational system. For example, the spreadsheet that the order entry clerk uses to manually look up valid values when entering data into the system is a vital piece of knowledge that should be captured at this time. Just because the system does not hold the decode map in its database does not mean that the information has no value.

This analysis includes the following items:

  • The application name
  • A description
  • The hardware platform
  • The software platform
  • Business requirements
  • A high-level data model
  • A high-level data dictionary
  • Hierarchy information
  • Technical implementation parameters

Table 1 clarifies the contents and purpose of each of these items.

Table 1 — An Application Summary

Application name This is the common business user name for the application: ORDENT
Description This is a brief description of the applications business purpose: The ORDENT system is the company’s primary order entry application. All product orders are fed into this application prior to fulfillment.
Hardware This is the name of the server on which this application resides: ORDENTSrv
Software This is the major server software employed in the development of this application: Oracle 9i database; Java front end
Business requirements This is a list of business users’ requirements for the application:

  • Must allow orders to be entered in real time by the order entry users while a representative is engaged with the customer through the call center
  • Must allow batch entry of orders by order entry clerks from order sheets faxed in from the field sales force
  • Must provide a fulfillment tracking mechanism via reports or ad hoc query
  • Must provide estimated time to deliver information to order entry user, who may provide this information to the customer
High-level data model Figure 8 (on page 16) shows a top-level data model (entity-relationship diagram) for the application. The data model shows only the first layer of the data model. The purpose of this model is not to document the existing application, but rather to provide a single point of reference for all data sources in the organization in the context of their interrelationships. In addition, it is assumed that the IT personnel already have complete documentation for the system. If this assumption proves false, and backward-engineering the existing application is required or desired, then that analysis activity should be scheduled into the planning effort. In any regard, only the top level of the model is included in this summary.
High-level data dictionary The dictionary defines the major data components of the application. As with the data model, the data dictionary is not meant to be a complete documentation of all data elements in the application. This dictionary is meant to be a reference point for those key elements that directly support the business requirements. Also, any elements that are used to group the data or provide linking mechanisms to other applications or data sources in the organization should be included in this section. The high-level data dictionary includes the following information about each major data element:

  • Data element name (logical name)
  • Business description (a descriptive definition of the element and its use in the business process)
  • Data element location (physical table and column names)
  • Source application (name of application, data feed, or “personal data silo”)
Hierarchy information The following defines the hierarchy reference data for the application. In addition to the data dictionary, data relationships are also maintained in what are termed functional hierarchies. Some of the more common hierarchies are G/L, legal entity, product, geographic, and organizational. Examples of these reference data hierarchies are as follows:G/L

  • Account
    • Subaccount
      • Cost center
      • Cost center
      • Cost center
    • Subaccount
      • Cost center

Legal entity

  • Family insurance company
    • Mom’s insurance company
      • Mom’s life
      • Mom’s health
    • Dad’s insurance company
      • Dad’s property and casualty


  • Automobile
    • Sports car line
      • Convertible
      • Coupe
    • Family car line
      • Sedan
      • Station wagon


  • USA
    • Pennsylvania
      • Chester County
      • Berks County
    • New Jersey
      • Camden County
    • Cherry Hill
    • Camden


  • CEO
    • CFO
      • Controller
      • Auditor
    • COO
      • EVP
        • VP
          • Manager
            • Worker bee

It should be noted that the hierarchies are logical views of the differing business processes and components as viewed and used by business users. The implementation team defines the actual physical implementation of each hierarchy during the application’s development and deployment. These hierarchies are also called “reference data” because they provide a common context from which all data in all sources may be logically organized and shared among one another.

Technical implementation parameters This section lists the primary technical requirements that were developed to satisfy the business requirements stated above. The ORDENT order entry system was developed to support the company’s sales organization and to satisfy the stated business objectives. In order to accomplish these goals, the following technical requirements were identified and implemented:

  • A Java application server platform was deployed to support the order entry front-end application.
  • Thin-client screens for all order entry and reporting/query screens functions were used.
  • Oracle 9i database was designed and implemented that contains the requisite tables and columns to support the order entry process.
  • A batch load facility was created that takes a spreadsheet (.xls) file as input that will populate all required order data. This is to support the faxed-in sales order business requirement.
  • An external call to the inventory management application was developed to allow the order entry user to query in real time the current inventory levels for a product from within the order entry application.
  • Estimated time to deliver information to order entry user who may provide this information to the customer was provided.

Again, the application assessment is not meant to document previous IT implementation projects, but rather to provide a high-level overview of each application running in the corporate information environment.

At the conclusion of these business and technology reviews, the foundation layers that bridge the gap of understanding between the business and technology communities have been established. By using these analysis documents, the organization is poised to build on a solid foundational understanding to drive the implementation of appropriate and abundantly useful technology to support the dynamic nature of today’s business climate.


Companies of varying sizes have attempted to support analytical procedures and applications internally, which has led to many implementation approaches. All but a few have been plagued by false starts, missteps, or outright failures. The nature of analytical applications requires planning and implementation measures and techniques that are significantly more involved, time-intensive, and complex than the relatively simple deployment of an inhouse-developed business application. At a minimum, the analytical infrastructure comprises all components that support the strategic decision-making needs of the organization. These components include, but are not limited to, the following:

  • Enterprise data repositories, such as data warehouses and data marts
  • Analytical applications (i.e., OLAP), such as Hyperion Essbase and Cognos PowerPlay
  • Data mining and visualization tools, such as IBM Intelligent Miner, Visipoint, and Quadstone Decisionhouse
  • Reporting tools, such as Business Objects and Hyperion Brio

As can be seen from the admittedly small list of analytical tools above, many players are emerging in a rapidly growing industry to support the analytical needs of the business workspace. When planning an organization’s integration framework, each kind of analysis must be considered. Otherwise, the underlying design of the deployed platform is likely to be inadequate, which minimizes the data available for any number of these applications.

Every IT shop in the world has some level of integration activity in place. In many cases, the sum total of past experience is in making finite sets of data from one legacy application available for use by another. Unfortunately, people tend to remain within their comfort zone; that is, they rely on the integration techniques that they have already deployed to satisfy the requirements imposed by the analytical applications platforms. This is where the trouble begins.

So how does an organization plan for the analytical portion of the integration framework? First, gather a thorough understanding of the business requirements, which directs the need for certain analytical applications. And second, research each component of the analytical platform. Each application will likely impose some informational or structural requirements on the technology integration effort.

Understand Business Needs

Time must be set aside to gather, analyze, and understand business users’ requirements in terms of reporting and analytical capabilities. All these requirements are logically clustered into like groups: reporting, visualization, mining, OLAP, and so on. These groups of requirements identify which application components are necessary to comprise the analytical platform. Most platform implementations will require only a limited subset of the possible analysis applications.

Figure 8Figure 8 — A sample top-level data model.

Another point to consider is the scope of the analytical platform required to support the business. Many companies will start with a small, limited analytical implementation. This will be installed as a departmental solution, consisting of one or two applications. In most of these limited projects, the intelligence application will be fed directly from data contained in spreadsheets, desktop databases, or small extracts from a single operational system, such as the G/L. While this satisfies the needs of the company today, it doesn’t involve planning to consider the future needs of the growing business. As a result, the integration platform and mechanism put in place will likely not be scalable enough to grow alongside the company either.

The organization that takes the longer-term view of the analytical needs of the company will inevitably be better positioned to succeed even if the IT group initially deploys only a subset of the total platform. By taking the long-range view, the development group is able to identify all potential data interactions, all integration requirements, and integration infrastructure components up front. This allows the design of a coordinated development and deployment plan for the integration framework.

This long-term approach is also the only path to overall success in terms of consistent data integration capabilities. No organization can survive by deploying tactical departmental analysis and integration projects. The departmental approach leads to extension of the systems’ stovepipes that the integration framework is designed to eliminate. Large and/or forward-thinking organizations will begin the design of their analytical platforms and, by extension, their integration framework with the business requirements to drive the technical implementation of the integration framework.

Understand the Analytical Platform

The previous section discussed how and why analytical platforms’ business requirements should be used as part of the planning phases of the integration framework. Once the tools have been identified, the next step is the design of the data repository portions of the analytical platform. It is commonly understood that business requirements lead directly to the organizational data necessary to be included in the data warehouse. Further, the data modeler will use the interactions between the business requirements to design the most effective data structures within the data warehouse.

Once the data interactions are understood and the basic structures have been defined, it is necessary to take a closer look at the other components of the analytical platform. Many of the tools on the market strongly recommend that certain data structures be built into the data repository in order to ensure optimal performance. In some cases, these recommendations are stronger than others. During the initial design of the warehouse data model, these recommendations must be known should the organization opt to use a product that actually mandates specific structures for data aggregation or relationship management.

One last point to consider is that not all tools work and play well together. Selecting the best server, the best database, and the best analytical products without regard for their interaction will not yield the best results. Some products are parts of complete suites that work effectively only with other products in the suite. Further, the supporting data structures required for one product may conflict with those of another.


As this report indicates, maintaining an infrastructure that enables IT to provide all the needs of an organization is not trivial. It is difficult for many organizations to build such structures initially, and it is difficult to create such structures and have the necessary bandwidth to allow them to grow to accommodate changing demands. The key is to understand that any IT organization can maintain “data”; storing transactions alone does not provide the critical capability of providing “information.” Here is the key distinction: Data is the result of underlying transactional events and is the data record in its purest form, stored as an electronic transaction; information, on the other hand, is the ability to relate various pieces of data to create useable facts. Data about sales to a customer during a time frame is a data point. Taking a series of data points for a customer and trending sales over time yields information. Customer X may buy more of a product during the summer than during the winter. This information is useful; if the trend holds true for any group of customers, it can be used to drive production and delivery decisions.

By extension, because the typical enterprise has collections of data points in a variety of unrelated systems, IT can provide a mechanism to integrate the disparate information that drives the quality of the information. This array of disparate information is critical to integrate, because it provides the information needed to drive strategic business decisions. Organizations must work on creating this infrastructure and capability because it is the quality of the information that will determine any organization’s ability to respond to key tactical and strategic questions. Figure 9 depicts the required framework for creating a successful integration infrastructure.

Figure 9Figure 9 — The integration framework.

The infrastructure enables support of an environment that makes accessible all the operational data and any required external data that must be stored. The integration layer must be dynamic, flexible, and extensible. A typical environment may contain a data warehouse, perhaps some data marts, a hierarchy management application, and a metadata repository. If provisioned and dynamically established, such an environment will interact easily with the reporting and compliance layer to produce required analysis cubes, executive information systems, planning information, regulatory reporting, statistical reporting, and information for setting corporate objectives. Today, all the emphasis has been on building these layers to accommodate the structured organizational data. In the future, it will be necessary to incorporate the unstructured data contained in an organization’s back-office systems. Future needs for compliance with SOX, HIPAA, and Basel II will force this incorporation.

Once the requirements for the organization are understood, only then should you begin to prepare the integration framework. Designing the integration framework should follow much the same path as designing a data warehouse:

  1. Gather and validate the business needs.
  2. Identify potential integration methodologies (e.g., custom code; extraction, transformation, and loading [ETL]; enterprise application integration [EAI]; enterprise information integration [EII]).
  3. Evaluate candidate tools.
  4. Deploy the selected tools.
  5. Provide access to the framework to the developers who will implement the data integration processes.

The nature of your corporate IT culture will play heavily into which methodology or methodologies are suitable for your organization. The organizations with many inhouse-developed legacy applications will most likely lean toward implementing any integration task through the use of custom code or possibly the use of ETL products. On the other end of the spectrum, those organizations that rely heavily on the use of newer, packaged applications have many more options available to them, such as the application interfaces of EAI or the virtual data repositories made possible with EII.

In order to effectively plan the integration environment, designers must have a solid understanding of the various methodologies. Table 2 features each major integration method as well as its pros and cons.

Table 2 — Integration Methods

Custom-coded integration processes. As the name implies, this method involves IT programmers writing custom code to perform all the data extraction from the source systems, the transformation logic, and the insertion of data into the target applications. This method is rarely used now and should be considered extensively before one decides to undertake it to integrate data.
  • Method enables integration of data from legacy applications that are not supported by any other technology.
  • Method requires significant development time from key IT support resources.
  • Any modification to source or target systems necessitates a complete code review and modification or redevelopment effort.
  • Chances for coder to introduce errors are much greater than for other technologies.
Extraction, transformation, and loading (ETL). ETL tools were one of the IT world’s major strides forward in that they essentially eliminated the need for programmers to write custom programs to perform the basic actions of moving and combining data between systems. Traditionally, ETL processes are batch oriented, with the goal of pulling large amounts of data from various source systems and databases on a set cycle. In the data warehouse space, this cycle is generally weekly or monthly; however, it is becoming increasingly common to see daily ETL processes developed. Unlike custom-coded processes, ETL tools provide the ability to perform almost any required transformation, aggregation, derivation, or merge/purge task through the use of provided modules that are linked together in a logical integration workflow. Even though the ETL tools have matured and support a wide variety of data manipulation tasks out of the box, a time may come when a transformation object must be customized. In these few cases, ETL tools will support the inclusion of custom code to be compiled and called from inside the ETL workflow. The end product of the ETL tools is a set of load files that may be inserted into the target environments via the databases’ native bulk load facilities. Alternatively, the higher-end ETL products will allow a stream of data to be fed into the database as part of the transformation process without the need for the final data loading steps.
  • High-performance transformation engine is provided.
  • Many data manipulation modules are available from tool providers.
  • Method requires little coding effort from the IT programmer staff.
  • ETL processes are easily updated as changes to source and target environments occur.
  • Method is primarily batch oriented.
  • Processes incur too much overhead to be used for real-time or near-real-time integration.
Enterprise application integration (EAI). EAI is quite similar to ETL in that it is a primarily batch-oriented method of sharing data between various data sources and applications. However, EAI is implemented as a middleware solution that interfaces with the source and target applications at the user interface and/or API level. For example, an EAI process could be set up that will retrieve the company’s chart of accounts and a limited subset of financial data from the SAP application by issuing a set of client interface calls that is processed by SAP just as though a person issued the request via the user interface. As with ETL products, EAI tools also provide a set of manipulation tools; however, these modules are not as robust. EAI is designed to provide a scriptable and automatable way to share limited amounts of data between applications. It is not designed to fulfill the larger integration needs of data warehouses or full-blown enterprise reporting and analysis.
  • Method provides data transformation engine.
  • Method requires little or no coding effort from IT programmer staff.
  • EAI processes are easily updated as changes to source and target environments occur.
  • Method allows for communication with sources and targets with end-user-understandable methods: user interface calls and API functions.
  • Method is primarily batch oriented.
  • Processes incur too much overhead to be used for real-time or near-real-time integration.
  • Method isn’t as robust or high performing as ETL tools; EAI is more of a departmental or ad hoc integration solution
Enterprise information integration (EII). As the name implies, EII is designed to provide information directly to users as they need it. It has been designed from the outset to provide access to information from multiple systems and integrate and consolidate data elements in real time when the user needs the information. EII supports most of the major reporting, analytical, and ad hoc query products on the market. As the user issues the request for information, the EII server identifies the data necessary to compile, gathers the data from the disparate source database and applications, transforms and aggregates the data accordingly, and presents the results to the user as though the matter had been a simple report or query from the data warehouse. With EII there is no need to physically combine all the data into one repository, nor a need for the user to know or care where the data is actually stored.
  • High-performance, real-time data manipulation engine is provided.
  • Data resides in its native repository until needed.
  • Query results are commonly quite fast.
  • Process requires little or no coding effort.
  • AII data mappings are easily updated as changes to source and target environments occur.
  • Process requires sophisticated indexing and caching technologies.
  • Server hardware requirements are higher than with the other integration methods.
  • Support staff must have a specialized skill set to build and maintain the data maps and caches.

The key to creating a successful environment is selecting complementary toolsets that can be placed in a well-thought-out, well-planned infrastructure. Creating such a model takes time, diligence, and intensive planning and information gathering. Organizations try and skimp on gathering requirements, but failure to undergo a full-blown needs-and-organizational-requirements analysis is much like an expectant married couple, eagerly awaiting the birth of its firstborn, who responds to the need for better transportation by buying a two-seater sports car. Such a solution may work in the short term but will be quickly outgrown once the child arrives. Solutions that are shortsighted and prove incapable of growing with the organization become more of a headache to maintain and support than the benefits they may initially provide. Even in the best-laid-out environments, understanding the gap between what is in place and what is being put in place mandates that an organization have the ability to respond quickly and without hesitation. The pain of response will be directly proportional to the success of the initial requirements gathering process as well as to how good a job is done at ferreting out what is likely to come up in the near term (three to six months) and in the long term (one or more years).


Today, the business world is fast-paced and business climates change rapidly, so business leaders must have immediate access to key business factors and performance metrics. This requires the capability not only to use the BI/BPM tools to see information in executive information systems (EIS), but also to view the detailed transactional data upon which the EIS information is based. Oftentimes, only by having access to the underlying detail are trends readily visible. In order to react quickly, the information must be organized effectively and presented efficiently. And to provide strategic value, these metrics must be presented from the viewpoints of different groups within the business. In order to accomplish this, the underlying detail must also be available within the context of users’ analysis.

With the advent of new regulatory requirements, the reliance on much newer and automated data analysis tools has taken a dominant position in business users’ arsenal for providing vital information for regulatory and statutory reporting. The mandatory nature of regulatory compliance has demanded a higher level of accountability and the creation of a far more foolproof and repeatable set of business processes that, supported by the new BI/BPM applications, provide the required level of analysis and detail to support reporting.

New trends in the business world mandated by SOX strengthen the need for new sources of disparate data. Whereas businesses were making decisions and building infrastructures only to support the use and analysis of structured data, new requirements now force the business community to deal with the world of unstructured back-office data as well. Today, a business must be fully accountable for all communications from and within the organization, and this massive amount of data can no longer be ignored. Future trends will force the volume of unstructured data to be unlocked and merged so all the appropriate data is available to the entire enterprise. Management can no longer sit back and assume that regulations are followed; it must now guarantee in writing that this is the case. This new organizational requirement forces enhancements and growth to current infrastructures as IT struggles for ways to integrate the structured and unstructured worlds to meet these new demands.

Since it falls to the IT organization to satisfy management’s need for this new level of information and to create a relative infrastructure with the capabilities to support these new tools, IT must move forward. In order for the business to grow and thrive, IT must succeed. In order for IT to succeed, technology architects must take a step back and be proactive rather than reactive. Increasingly, software vendors are introducing themselves directly to the business community, and their technologies include capabilities that allow business users to accomplish some of the “traditional” IT workload of integrating data. As a result, business users tend to think that if these tools facilitate their jobs, it should be simple for an IT group to support the new technology. But this isn’t always the case, and IT must be prepared to indicate why. The requirements under the covers for these new applications may not necessarily play nice with the current organizational infrastructure, and the cost of marrying new technologies with antiquated infrastructures is extremely high.

Technology people are notoriously conservative when it comes to considering new ways of accomplishing their jobs. When ETL tools first hit the market, for example, it was business sponsors — not the IT community — who first supported IT efforts for the decision support projects that drove IT away from developing custom code and toward performing integration and using new products. It was a business decision that used time to market and reduced staffing requirements to push IT management to use this new technology. Over the past decade, the integration space has not rested on its laurels. Now IT faces demands to evaluate, deploy, and/or support any combination of legacy custom integration code, ETL processes, EAI, and EII services. Given the various direct-access tools that many of the BI/BPM tools provide, the need for a coordinated integration framework is the only commonsense approach for an organization’s CIO.

IT management must be willing to accept some short-term criticism from business users in order to buy the necessary time to plan for the ultimate success of all. By taking the time to understand the nature of the business, the needs of users, the current and planned future technology architectures, and hardware and software technologies currently in use and in development, IT planners can design an infrastructure that meets all of today’s needs while remaining flexible enough to support the anticipated, and unanticipated, growth of the company. This integration framework allows all data-generating and data-gathering points in the company — regardless of their location or types of data they store — to communicate, share, and evolve into an integrated information factory that drives rather than hinders corporate success.

Implementing BI in the Cloud

Cloud computing offers a lot of promise. By virtualizing hardware and software infrastructure and paying a third party to deliver services as you go on a subscription or usage basis, companies can save a lot of money and time, and speed the deployment of business solutions.

Initially, cloud-based solutions were designed for small- to mid-size companies that didn’t have available IT resources or capital to spend on creating and managing a software and hardware infrastructure. Today, many large companies are investigating the cloud as a way to add new business solutions quickly and augment existing data center capacity. But cloud computing isn’t for everyone, especially in the BI space.

Types of Cloud Offerings

To understand what makes sense to deploy in the cloud, you first have to fathom what the cloud does. In essence, the cloud abstracts underlying services and is a common metaphor for the Internet, which routes data dynamically across a global network based on capacity and other factors. Today’s cloud delivers three levels of services that together comprise a solutions stack: applications, platforms, and infrastructure services. (See figure 1.) 

SaaS. Application services are typically called software-as-a-service (SaaS)., which was founded in 1999 to deliver sales solutions online to small- and mid-sized companies, popularized the notion of SaaS. now boasts 1.1 million subscribers and has spawned lots of imitators. With SaaS, employees use a browser to access an online application, which is housed and managed by the SaaS provider. There is no hardware to configure or software to install and no licenses to purchase. You simply pay a monthly subscription fee for each user, and you’re up and running.

IaaS and PaaS. In the past several years, the cloud industry has grown, spawning two more services: platform as a service (PaaS) and infrastructure as a service (IaaS). Amazon popularized the latter with the advent of its EC2 cloud computing offering, which lets IT administrators dynamically provision servers in Amazon’s data center and pay according to usage. Many IT administrators now use IaaS as a convenient, low-cost way to maintain development, test, or prototyping environments or to support analytic sandboxes that have a short lifespan. PaaS is the newest addition to the cloud family, allowing developers to build and deploy custom cloud-based applications and solutions. Many PaaS customers are ISVs that want to create cloud-based offerings or enhance them with complementary applications, such as reporting or analysis.

BI in the Cloud

From a BI perspective, all three incarnations of the cloud offer interesting possibilities, but come with constraints. For instance, SaaS offerings are essentially packaged analytic applications. Like their on premises brethren, SaaS offerings need to be highly tailored to an application domain so the solution fits the customer requirements like a glove and doesn’t require endless and unprofitable rounds of customization.  And it doesn’t do much good if the SaaS vendor only supports one application out of several because the customer then will end up with a mix of on premise and hosted solutions that are difficult to integrate. So, unless the SaaS vendor supports a broad range of integrated functional applications, it’s hard to justify purchasing any SaaS application.

Data Transfers.  Another constraint is that all three types of BI cloud offerings need to transfer data from an internal data center to the cloud. Most BI solutions query a data warehouse or data mart that is continuously loaded from operational systems residing in the company’s data center. Currently, moving large volumes of data on a regular basis to the cloud over the public internet injects latency and complexity into the load process and can become expensive since cloud providers charge fees for data transfers and data storage. In addition, users that query cloud-based data marts using BI tools that run on inhouse servers, then their queries and result sets also travel across the internet, adding more latency and cost.

Given this constraint, BI cloud-based solutions are ideal in the following situations:

  1. Small companies that don’t have a lot of data.
  2. Applications that don’t require lots of data, such as development and test environments or small data marts that can be updated quickly.
  3. Applications in which all source data already exists in the cloud (e.g. or a start-up company that runs its entire business in the cloud.
  4. Ad hoc analyses that require one-time import of data from one or more sources. The cloud is proving an ideal way to support data-hungry analysts.
  5. Report sharing

Data security.  Data security is another constraint, but one that is largely illusory. Companies are reluctant to move data outside of the corporate firewall for fear that it might get lost or stolen. Highly publicized data thefts in recent years certainly feed this sentiment. But the fear is largely irrational. Most companies already outsource sensitive data to third party processors, including payroll (e.g. ADP) and customer and sales data (e.g. Salesforce). And when IT administrators examine the data center and application level security supported by cloud vendors, most will say that the data is probably more secure in these data centers than their own! Most new technologies encounter the same criticisms: for example, many thought e-commerce would lead to widespread fraud when it first became available in the late 1990s.

Due Diligence. Nonetheless, companies looking to “outsource” applications, platforms, or infrastructure to the cloud should investigate the cloud providers operations to ensure that they can meet your system level agreements for security, availability, reliability, scalability, and performance. For instance, what is the providers failover and backup procedures? Do they have a disaster recovery plan? Do they comply with SAS 70 data center security protection guidelines?

In addition, you should carefully analyze pricing policies and total cost of ownership. Does the SaaS provider charge set up or cancellation fees? At what point in the future will the total cost of the SaaS solution cost more than if you had purchased a premises-based license?

Finally, you should analyze the vendor’s viability. SaaS vendors take on greater risk than traditional software vendors because their financial model accumulates revenues on a subscription basis rather than upfront. And since SaaS vendors must invest in more hardware and customer support resources, they are prone to suffer from lack of capital. As testimony to the challenge of launching SaaS-based products, LucidEra, one of the first BI for SaaS offerings, closed its doors in June because it couldn’t secure another round of funding.


BI for SaaS offers a lot of promise to reduce costs and speed deployment but only for companies whose requirements are suitable to cloud-based computing. Today, these are companies that have limited or no available IT resources, little capital to spend on building compute-based or software capabilities inhouse, and whose BI applications don’t require significant, continuous transfers of data from source systems to the cloud.


Text Mining & Web Mining

Filed under: All topics,Text Mining & Web Mining — Lee Shi Huan @ 1:33 pm

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the divining of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. ‘High quality’ in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

Text categoraization

Document classification/categorization is a problem in information science. The task is to assign an electronic document to one or more categories, based on its contents. Document classification tasks can be divided into two sorts: supervised document classification where some external mechanism (such as human feedback) provides information on the correct classification for documents, and unsupervised document classification, where the classification must be done entirely without reference to external information. There is also a semi-supervised document classification, where parts of the documents are labeled by the external mechanism.

Text clustering

Document clustering (also referred to as Text clustering) is closely related to concept of data clustering. Document clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering. For example, a web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by Enterprise Search engines such as Northern Light and Vivisimo.
Example:, the official Web portal for the U.S. government, uses document clustering to automatically organize its search results into categories. For example, if a user submits “immigration”, next to their list of results they will see categories for “Immigration Reform”, “Citizenship and Immigration Services”, “Employment”, “Department of Homeland Security”, and more.

Concept/entity extraction

Concept mining is an activity that results in the extraction of concepts from artifacts. Solutions to the task typically involve aspects of artificial intelligence and statistics, such as data mining and text mining. Because artifacts are typically a loosely structured sequence of words and other symbols (rather than concepts), the problem is nontrivial, but it can provide powerful insights into the meaning, provenance and similarity of documents.

Traditionally, the conversion of words to concepts has been performed using a thesaurus, and for computational techniques the tendency is to do the same. The thesauri used are either specially created for the task, or a pre-existing language model, usually related to Princeton’s WordNet.

The mappings of words to concepts are often ambiguous. Typically each word in a given language will relate to several possible concepts. Humans use context to disambiguate the various meanings of a given piece of text, where available. Machine translation systems cannot easily infer context.

For the purposes of concept mining however, these ambiguities tend to be less important than they are with machine translation, for in large documents the ambiguities tend to even out, much as is the case with text mining.

There are many techniques for disambiguation that may be used. Examples are linguistic analysis of the text and the use of word and concept association frequency information that may be inferred from large text corpora. Recently, techniques that base on semantic similarity between the possible concepts and the context have appeared and gained interest in the scientific community.

One of the spin-offs of calculating document statistics in the concept domain, rather than the word domain, is that concepts form natural tree structures based on hypernymy and meronymy. These structures can be used to produce simple tree membership statistics, that can be used to locate any document in a Euclidean concept space. If the size of a document is also considered as another dimension of this space then an extremely efficient indexing system can be created. This technique is currently in commercial use locating similar legal documents in a 2.5 million document corpus.

Standard numeric clustering techniques may be used in “concept space” as described above to locate and index documents by the inferred topic. These are numerically far more efficient than their text mining cousins, and tend to behave more intuitively, in that they map better to the similarity measures a human would generate.

Web mining – is the application of data mining techniques to discover patterns from the Web. According to analysis targets, web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining.

Web usage mining is the process of finding out what users are looking for on internet. Some users might be looking at only textual data whereas some other might want to get multimedia data. Web usage mining also helps finding the search pattern for a particular group of people belonging to a particular region.

Web structure mining is the process of using graph theory to analyse the node and connection structure of a web site. According to the type of web structural data, web structure mining can be divided into two kinds.

The first kind of web structure mining is extracting patterns from hyperlinks in the web. A hyperlink is a structural component that connects the web page to a different location. The other kind of the web structure mining is mining the document structure. It is using the tree-like structure to analyse and describe the HTML (Hyper Text Markup Language) or XML (eXtensible Markup Language) tags within the web page.

Web mining Pros and Cons


Web mining essentially has many advantages which makes this technology attractive to corporations including the government agencies. This technology has enabled ecommerce to do personalized marketing, which eventually results in higher trade volumes. The government agencies are using this technology to classify threats and fight against terrorism. The predicting capability of the mining application can benefits the society by identifying criminal activities. The companies can establish better customer relationship by giving them exactly what they need. Companies can understand the needs of the customer better and they can react to customer needs faster. The companies can find, attract and retain customers; they can save on production costs by utilizing the acquired insight of customer requirements. They can increase profitability by target pricing based on the profiles created. They can even find the customer who might default to a competitor the company will try to retain the customer by providing promotional offers to the specific customer, thus reducing the risk of losing a customer.


Web mining the technology itself doesn’t create issues, but this technology when used on data of personal nature might cause concerns. The most criticized ethical issue involving web mining is the invasion of privacy. Privacy is considered lost when information concerning an individual is obtained, used, or disseminated, especially if this occurs without their knowledge or consent.The obtained data will be analyzed, and clustered to form profiles; the data will be made anonymous before clustering so that no individual can be linked directly to a profile. But usually the group profiles are used as if they are personal profiles. Thus these applications de-individualize the users by judging them by their mouse clicks. De-individualization, can be defined as a tendency of judging and treating people on the basis of group characteristics instead of on their own individual characteristics and merits.
Another important concern is that the companies collecting the data for a specific purpose might use the data for a totally different purpose, and this essentially violates the user’s interests. The growing trend of selling personal data as a commodity encourages website owners to trade personal data obtained from their site. This trend has increased the amount of data being captured and traded increasing the likeliness of one’s privacy being invaded. The companies which buy the data are obliged make it anonymous and these companies are considered authors of any specific release of mining patterns. They are legally responsible for the contents of the release; any inaccuracies in the release will result in serious lawsuits, but there is no law preventing them from trading the data.
Some mining algorithms might use controversial attributes like sex, race, religion, or sexual orientation to categorize individuals. These practices might be against the anti-discrimination legislation. The applications make it hard to identify the use of such controversial attributes, and there is no strong rule against the usage of such algorithms with such attributes. This process could result in denial of service or a privilege to an individual based on his race, religion or sexual orientation, right now this situation can be avoided by the high ethical standards maintained by the data mining company. The collected data is being made anonymous so that, the obtained data and the obtained patterns cannot be traced back to an individual. It might look as if this poses no threat to one’s privacy, actually many extra information can be inferred by the application by combining two separate unscrupulous data from the user


Data Warehouse & OLAP

Filed under: All topics,Data Warehouse & OLAP — Lee Shi Huan @ 1:21 pm

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.

In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

Subject Oriented

Data warehouses are designed to help you analyze data. For example, to learn more about your company’s sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like “Who was our best customer for this item last year?” This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.


Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.


Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.

Time Variant

In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse’s focus on change over time is what is meant by the term time variant.

Contrasting OLTP and Data Warehousing Environments

Figure 1-1 illustrates key differences between an OLTP system and a data warehouse.

Figure 1-1 Contrasting OLTP and Data Warehousing Environments

Text description of dwhsg005.gif follows

One major difference between the types of system is that data warehouses are not usually in third normal form (3NF), a type of data normalization common in OLTP environments.

Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems:

  • WorkloadData warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations.

    OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations.

  • Data modificationsA data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse.

    In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction.

  • Schema designData warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance.

    OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency.

  • Typical operationsA typical data warehouse query scans thousands or millions of rows. For example, “Find the total sales for all customers last month.”

    A typical OLTP operation accesses only a handful of records. For example, “Retrieve the current order for this customer.”

  • Historical dataData warehouses usually store many months or years of data. This is to support historical analysis.

    OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.

OLAP Tool Functionalities

Before we speak about OLAP tool selection criterion, we must first distinguish between the two types of OLAP tools, MOLAP (Multidimensional OLAP) and ROLAP (Relational OLAP).1. MOLAP: In this type of OLAP, a cube is aggregated from the relational data source (data warehouse). When user generates a report request, the MOLAP tool can generate the create quickly because all data is already pre-aggregated within the cube.2. ROLAP: In this type of OLAP, instead of pre-aggregating everything into a cube, the ROLAP engine essentially acts as a smart SQL generator. The ROLAP tool typically comes with a ‘Designer’ piece, where the data warehouse administrator can specify the relationship between the relational tables, as well as how dimensions, attributes, and hierarchies map to the underlying database tables.Right now, there is a convergence between the traditional ROLAP and MOLAP vendors. ROLAP vendor recognize that users want their reports fast, so they are implementing MOLAP functionalities in their tools; MOLAP vendors recognize that many times it is necessary to drill down to the most detail level information, levels where the traditional cubes do not get to for performance and size reasons.So what are the criteria for evaluating OLAP vendors? Here they are:

  • Ability to leverage parallelism supplied by RDBMS and hardware: This would greatly increase the tool’s performance, and help loading the data into the cubes as quickly as possible.  
  • Performance: In addition to leveraging parallelism, the tool itself should be quick both in terms of loading the data into the cube and reading the data from the cube.  
  • Customization efforts: More and more, OLAP tools are used as an advanced reporting tool. This is because in many cases, especially for ROLAP implementations, OLAP tools often can be used as a reporting tool. In such cases, the ease of front-end customization becomes an important factor in the tool selection process.  
  • Security Features: Because OLAP tools are geared towards a number of users, making sure people see only what they are supposed to see is important. By and large, all established OLAP tools have a security layer that can interact with the common corporate login protocols. There are, however, cases where large corporations have developed their own user authentication mechanism and have a “single sign-on” policy. For these cases, having a seamless integration between the tool and the in-house authentication can require some work. I would recommend that you have the tool vendor team come in and make sure that the two are compatible.  
  • Metadata support: Because OLAP tools aggregates the data into the cube and sometimes serves as the front-end tool, it is essential that it works with the metadata strategy/tool you have selected.Popular Tools
    • Business Objects
    • Cognos
    • Hyperion
    • Microsoft Analysis Services
    • MicroStrategy
    • Pentaho
    • Palo OLAP Server



    Top Ten Reasons to Do OLAP

    Business Intelligence

    Information Management Magazine, June 1998 

    Susan Osterfelt 

    #10. To see what all the fuss is about. OLAP is hot. The press has been having a field day reporting on OLAP vendor product announcements and user testimonials. You cannot open a technology magazine without seeing articles about enhancements to OLAP products or the value a business has derived from doing multidimensional analysis. 

    #9. To answer the complaint, “I spend too much of my time gathering data and not enough time analyzing it.” The effort to find and filter data far outweighs the effort spent in its interpretation. If data were presented in a way that facilitates analysis, rather than having to do a separate query for each question that arises, analysts could do what they were hired to do–analyze information and make decisions on business direction. OLAP was designed precisely to facilitate analysis. 

    #8. OLAP complements the use of other business intelligence tools. Data access tools provide the ability to ask an ad hoc question and get an immediate result. Enterprise reporting tools allow for the delivery of information to the masses. Decision support and executive information system tools present a predefined “dashboard” approach to data analysis. Data mining, despite what some people think, is not synonymous with OLAP. Data mining involves the use of statistical techniques to determine patterns in data. OLAP is a way of presenting relational data to users to facilitate understanding the data and those important patterns that lie within it. OLAP is not specifically a tool for data mining, but rather a tool for presenting and drilling down into information to determine data interrelationships. It is truly on-line analytical processing. OLAP should be a key component in any organization’s complete business intelligence tool suite. 

    #7. OLAP technology is mature. Well, relatively mature. OLAP has been around long enough that its mark in terms of technology has been made. OLAP does something no other product set does and is no longer bleeding edge in its implementation. This should ease any “early adopter” fears. 

    #6. To stake a claim in the MOLAP vs. ROLAP debate (or possibly implement each where it is appropriate). Multidimensional database management systems (MOLAP or MDBMS) use proprietary data stores. Data must be copied or moved into these data stores, and pre-aggregation provides superior response time. However, MOLAP has limited scalability. Relational OLAP (ROLAP or RDBMS), provides multidimensional analysis against data that remains in a relational database management system. It requires building and maintaining a star schema data structure, but it can support very large databases. In reality, the debate is quieting somewhat as organizations realize that one approach is not better than the other. Rather, the key is to use each type of OLAP where appropriate. 

    #5. To achieve speed or response time performance gains. If performance is a major requirement, OLAP (read, MOLAP) may be the only way to reach your goal. By pre-aggregating data, OLAP can provide superior response time. 

    #4. Results of OLAP analyses can be presented visually, which enables improved comprehension. Presenting multidimensional data graphically in addition to spatially improves its comprehension. 

    #3. OLAP is implementable on the Web. Many, if not most, OLAP vendors have announced or are working on Web-enabled versions of their products. This is terrific in terms of delivery of OLAP to the business masses, since users can use their browser to perform OLAP instead of having to install, utilize and learn a specific client OLAP tool. An example of a recent development is SQRIBE Technologies, which recently announced a 100 percent Java server OLAP product, called PowerSQRIBE, which downloads Java applets to clients for performing analysis and accesses databases via a JDBC (Java Database Connectivity) driver at the server level. While client-heavy OLAP tools can have more functionality, better graphical interfaces and faster response times, Web-enabled OLAP tools can be quickly deployed to thousands of distributed workers at a low cost with little training. 

    #2. OLAP facilitates the business decision-making process. The OLAP concept is the iterative process most business analysts use, where asking and answering one question generates three or four additional questions. The analyst’s desire, when looking at a two-dimensional spreadsheet information format, to make rows into columns and columns into rows is easily performed with a rotate function. OLAP provides simple navigation of information to quickly drill down into most business questions. 

    And the number one reason to do OLAP: Your competition is already doing it. OLAP provides competitive advantage by providing immediate business value. And you can be assured that your competitors have been analyzing and responding to business information trends by using OLAP tools. Don’t you want to shorten the cycle time on making critical business decisions? Implement OLAP.


    Future of Data Warehouse, Data Mining & Data Visualisation

    A focus on the return to basics – drive revenue, cut costs, and acquire and retain customers.

    Other trends indicate that the future will bring a level of complexity and business importance that will raise the bar for all of us. The real-time implementation of a business action, decision or change of direction that is based on the results of strategic data analysis is now the reality. The data issues surrounding this trend aren’t getting any easier or smaller. Combine the need for real-time data warehousing and increased data size and complexity, and we set the stage for a new type of warehouse – the “virtual” enterprise data warehouse. This virtual DW or private hub for both operational and informational needs will begin to drive new demands on the ability of organizations to assimilate vast data assets stored in merged/acquired companies or divisional enterprise resource planning (ERP) and legacy environments. The time and/or dollars needed to integrate all the operational systems will make the traditional method of data integration impractical. This intersection of Web channels and data warehousing has the potential to become the standard architecture for large, complex organizations.

    Next, it is without a doubt that the entire customer relationship management (CRM) explosion is driving a large portion of the current data warehouse projects in the industry. But is this really new? I’ve always stated that those of us who are veterans in the DW world were building and running CRM environments before there was an acronym for CRM. Furthermore, we are definitely seeing a reverse evolution in the CRM space, which accentuates the importance of analyzing and measuring the effectiveness and efficiency of operational CRM capabilities. Continuing on the theme of measurement, remember the executive information system (EIS)? The EIS was the easy-to-use, show-me-the-numbers application directed at the senior business management of a company. The intent of an EIS was to make the key performance indicators essential to running a business available at the touch of a button. These applications were the rage in the mid-’80s. The EIS lives again as the digital dashboard. All the leading online analytical processing (OLAP) software vendors have developed or are in the process of developing key performance indicator modules; and many dashboard-specific vendors are starting up and/or growing rapidly.

    Lastly, the complete scope of the organization, structure and processes needed for a successful customer deployment of a large, complex data warehouse continues to be a major problem for most organizations. Making it happen successfully takes a unique focus and team of people. Mainly, the ability of an organization to change or dispose of the conventional wisdom regarding the correct mix and role of staff is important. Regarding this topic, I have the following observations. Your customer deployment team structure must mimic your desired business goals; and the team should reside within the same space – business operations, business analysts and IT professionals need to learn to cohabitate. Training must go beyond basic tool training to include pure business solution analysis tied to the desired outcome. Tight tracking and measurement of the expected business return has to be easily understood and continually revisited against expectations. Include in your engineering of the customer deployment shop not just what the team members will do, but how they will do it. What process can be made repeatable, scalable and flexible enough to change with the business?

    The complexity of data mining must be hidden from end-users before it will take the true center stage in an organization. Business use cases can be designed, with tight constrains, around data mining algorithms.


    Information Dashboard Design December 2, 2009

    Filed under: All topics,Information Dashboard Design — Lee Shi Huan @ 8:38 am

    Designing Effective Metrics Management Dashboards

    Designers of metrics management dashboards need to incorporate three areas of knowledge and expertise when building dashboards. They must understand the dashboard users’ needs and expectations both for metrics and for the presentation of those metrics; they must understand where and how to get the data for these metrics; and they must apply uniform standards to the design of dashboards and dashboard suites in order to make them ‘Intuitive’ for the end-users.

    Use Color Judiciously

    Users expect color to provide important information they need, not distract them.

    Generally, color can be used to for four effects on a dashboard. It can:

    • Identify the status of key metrics and areas that require attention. For example, use red to identify expenditures that have increased more than 15 percent over the same period the previous year.
    • Identify types of information. Color can be used to help users instantly identify the type of information they are looking at.

    For example, dark green can be used for monetary values, and dark blue for quantities of items.

    • De-emphasize areas or items. Border areas, backgrounds and other supporting dashboard components (the dashboard skins) should use plain, unobtrusive colors that help define dashboard areas without distracting from the information displayed.
    • Identify the dashboard type or its level. Different background colors, or the color of dashboard titles can help users identify what they are looking at. For example, financial dashboards could use a green skin, while help desk dashboards could use a beige as a reassuring color.

    Usability Checklist

    Attention given to dashboard design can pay enormous dividends, both in user satisfaction with the dashboards and, especially, in improved business performance founded in pro-active metrics management and reasoned, informed decision-making processes.

    Dashboard users want their questions about the metrics they are viewing on a dashboard answered even before they can formulate the questions. In fact, by the time a question about business performance is asked, it is often too late to take corrective action.

    Valuable dashboards provide up-to-date information right at the user’s fingertips so that problem areas can be addressed as they arise, and opportunities can be taken advantage of immediately.

    Ensuring that a dashboard and its components answer the questions in the table below should help dashboard designers create more effective dashboards.

    User Questions & Design Solutions

    What am I looking at?

    Use clear, descriptive titles and labels.

    Does this mean ‘good’ or ‘bad’?

    Use standard, culturally accepted colors and symbols.

    Are things getting better or worse?

    Employ thresholds, and show meaningful comparisons and trends.

    What is being measured, and what are the units if measure?

    Clearly identify the units of measure, and provide actual values.

    What is the target or norm?

    Clearly show targets and norms, and design displays that show progress towards these.

    How recent is the data?

    Provide a date and time stamp for each metric.

    How can I get more details?

    Provide drill-down links to groups with detailed information.

    How can I get a broader view?

    Provide links to roll-up and overview dashboards.

    What do I do with this information: what action should I take?

    Always place data in context, and where possible suggest advisable actions based on the metric.

    When should I check for an update?

    Provide the date and time when the metric will be updated. When business needs warrant, allow adhoc updates.

    How do I get metrics that are not on these dashboards?

    Be ready to develop new dashboards. Users will want them!


    It is important for dashboard to be designed in such a way that it conveys the message to the users effectively. Only thn, users are able to better analysis and come up with better solutions (or immediate for pressing issues) to help improve the organisation. Thus, the purpose of a dashboard is met. The dashboard design must always be in the user’s point of view to benefit them.


    Data Warehouse November 29, 2009

    Filed under: All topics,Data Warehouse — Lee Shi Huan @ 5:38 pm

    What Data Errors You May Find When Building a Data Warehouse?

    • Incomplete
    • Incorrect
    • Incomprehensible
    • Inconsistent

    Incomplete errors 

    These consist of:

    – Missing records – This means a record that should be in a source system is not there. Usually this is caused by a programmer who diddled with a file and did not clean up completely. (I read a white paper about how users have to “fess up” about bad data. Actually, usually system personnel cause MUCH more headaches than users.) Note you may not spot this type of error unless you have another system or old reports to tie to.

    – Missing fields – These are fields that should be there but are not. There is often a mistaken belief that a source system requires entry of a field.

    – Records or fields that, by design, are not being recorded– That is, by intelligent or careless design, data you want to store in the data warehouse are not being recorded anywhere. I further divide this situation into three categories. First, there may be dimension table attributes you will want to record but which are not in any system feeding the data warehouse. For example, the marketing user may have a personal classification scheme for products indicating the degree to which items are being promoted. Second, if you are feeding the same type of data in from multiple systems you may find that one of the source systems does not record a field your user wants to store in the data warehouse. Third, there may be “transactions” you need to store in the data warehouse that are not recorded in a explicit manner. For example, updating the source system may not necessarily cause the recording of a transaction. Or, sometimes adjustments to source system data are made downstream from the source system. Off–invoice adjustments made in general ledger systems are a big offender. In this case you may find that the grain of the information to be stored in the warehouse may be lost in the downstream system.

    Incorrect errors

    You can say that again! That is, the data really are incorrect. Wrong (but sometimes right) codes This usually occurs when an old transaction processing system is assigning a code that the transaction processing system users do not care about. Now if the code is not valid, you are going to catch it. The “gotcha” comes when the code is wrong but it is still a valid code. For example, you may have to extract data from an ancient repair parts ordering system that was programmed in 1968 to assign a product code of 100 to all transactions. Now, however, product code 100 stands for something other than repair parts.

    – Wrong calculations, aggregations – This situation refers to when you decide to or have to load data that have already been calculated or aggregated outside the data warehouse environment. You will have to make a judgment call on whether to check the data. You may find it necessary to bring data into the warehouse environment solely to allow you to check the calculation.

    – Duplicate records – There usually are two situations to be dealt with. First, there are duplicate records within one system whose data are feeding the warehouse. Second, there is information that is duplicated in multiple systems that feed in the same type of information. For example, maybe you are feeding in data from an order entry system for products and an order entry system for services. Unbeknownst to you, your branch in West Wauwatosa is booking services in both the product and service order entry systems. (The possibility of situation like this may sound crazy until you encounter the quirks in real world systems.) In both cases, note that you may miss the duplicates if you feed already aggregated data into the warehouse.

    – Wrong information entered into source system – Sometimes a source system contains data that were simply incorrectly entered into the system. For instance, someone may have keypunched 6/9/96 as 9/6/96. Now the obvious action is to correct the source system. However, sometimes, for various reasons, the source system cannot be corrected. Note that if you have many errors in a source system that cannot be corrected, you have a much larger issue in that you do not really have a reliable “system of record.”

    – Incorrect pairing of codes – This is best described by an example. Sometimes there are supposed to be rules that state that if a part number suffix is XXX, then the category code should be either A, B, or C. In more technical terms, there is a non-arithmetic relationship between attributes whose rules have been broken.

    Incomprehensibility errors

    These are the types of conditions that make source data difficult to read.

    – Multiple fields within one field – This is the situation where a source system has one field which contains information that the data warehouse will carry in multiple fields. By far the most common occurrence of this problem is when a whole name, e.g., “Joe E. Brown”, is kept in one field in the source system and it is necessary to parse this into three fields in the warehouse.

    – Weird formatting to conserve disk space – This occurs when the programmer of the source system resorted to some out of the ordinary scheme to save disk space. In addition to singular fields being formatted strangely, the programmer may also have instituted a record layout that varies.

    – Unknown codes – Many times you can figure out what 99% of what codes mean. However, you usually find that there will be a handful of records with unknown codes and usually these records contain huge or minuscule dollar amounts and are several years old.

    – Spreadsheets and word processing files – Often in order to perform the initial load of a data warehouse it is necessary to extract critical data being held in spreadsheet files and/or “merge list” files. However, often anything goes in these files. They may contain a semblance of a structure with data that are half validated.

    – Many-to-many relationships and hierarchical files that allow multiple parents – Watch out for this architecture in source systems. It is easy to incorrectly transfer data organized in such manner.

    Inconsistency errors

    The category of inconsistency errors encompasses the widest range of problems. Obviously similar data from different systems can easily be inconsistent. However, data within one system can be inconsistent across locations, reporting units, and time.

    – Inconsistent use of different codes – Much of the data warehousing literature gives the example of one system that uses “M” and “F” and another system that uses “1” or “2” to distinguish gender. May I suggest that you wish that this is the toughest data cleaning problem you will face.

    – Inconsistent meaning of a code – This is usually an issue when the definition of an organizational entity changes over time. For example, say in 1995 you have customers A, B, C, and D. In 1996, customer A buys customer B. In 1997, customer A buys customer C. In 1998, Customer A sells of part of what was A and C to customer D. When you build your warehouse in 1999, based on the type of business analysis you perform, you may face the dilemma of how to identify the sales to customers A, B, C, and D in previous years.

    – Overlapping codes –This is a situation where one source system records, say, all its sales to Customer A with three customer numbers and another source system records its sales to customer A with two different customer numbers. Now, the obvious solution is to use one customer number here. The problem is that there is usually some good business reason why there are five customer numbers.

    – Different codes with the same meaning – For example, some records may indicate a color of violet and some may indicate a color of purple. The data warehouse users may want to see these as one color. More annoyingly, sometimes spaces and other extraneous information have been inconsistently embedded in codes.

    – Inconsistent names and addresses – Strictly speaking this is a case of different codes with the same meaning. My unscientific impression of this type of problem is that decent knowledge of string searching will allow you to relatively easily make name and address information 80% consistent. Going for 90% consistency requires a huge jump in the level of effort, Going for 95% consistency requires another incremental huge jump in effort. As for 100% consistency in a database of substantial size, you may want to decide if sending a person to Mars is easier.

    – Inconsistent business rules – This, for the most part, is a fancy way of saying that calculated numbers are calculated differently. Normally, you will probably avoid loading calculated numbers into the warehouse but there sometimes is the situation where this must be done. As noted before, you may have to feed data into the warehouse solely to check calculations. – This can also mean that a non–arithmetic relationship between two fields (e.g., if a part number suffix is XXX, then the category code should be either A, B, or C) is non consistently followed.

    – Inconsistent aggregating – Strictly speaking this is a case of inconsistent business rules. In a nutshell, this refers to when you need to compare multiple sets of aggregated data and the data are aggregated differently in the source systems. I believe the most common instance of this type of problem is where data are aggregated by customer.

    – Inconsistent grain of the most atomic information – Certain times you need to compare multiple sets of information that are not available at the same grain. For example, customer and product profitability systems compare sales and expenses by product and customer. Often sales are recorded by product and customer but expenses are recorded by account and profit center. The problem occurs when there is not necessarily a relation between the customer or product grain of the sales data and the account – profit center grain of the expense data.

    – Inconsistent timing – Strictly speaking this is a case of inconsistent grain of the most atomic information. This problem especially comes into play when you buy data. For example, if you work for a pickle company you might want to analyze purchased scanner data for grocery store sales of gherkins. Perhaps you purchase weekly numbers. When someone comes up with the idea to produce a monthly report that incorporates monthly expense data from internal systems, you’ll find that you are, well, in a pickle.

    – Inconsistent use of an attribute – For example, an order entry system may have a field labeled shipping instructions. You may find that this field contains the name of the customer purchasing agent, the e–mail address of the customer, etc. A more difficult situation is when different business policies are used to populate a field. For example, perhaps you have a fact table with ledger account numbers. You may find that entity A uses account ‘1000’ for administrative expenses while entity B uses ‘1500’ for administrative expenses. (This problem gets more interesting if entity A uses ‘1500’ and entity B uses ‘1000’ for something other than administrative expenses.)

    – Inconsistent date cut–offs – Strictly speaking this is a case of inconsistent use of an attribute. This is when you are merging data from two systems that follow different policies as to dating transactions. As you can imagine, the issue comes up most with dating sales and sales returns.

    – Inconsistent use of nulls, spaces, empty values, etc. – Now this is not the hardest problem to correct in a warehouse. It is easy, though, to forget about this until it is discovered at the worst possible time.

    – Lack of referential integrity – It is surprising about how many source systems have been built without this basic check.

    – Out of synch fact data – Certain summary information may be derived independently from data in different fact tables. For example, a total sales number may be derived from adding up either transactions in a ledger debit/credit fact table or transactions in a sales invoice fact table. Obviously there may be differences because one table is updated later than another table. Often, however, the differences are symptoms of deeper problems.

    Aspects of Data Warehouse Architecture

    Data consistency architecture

    This is the choice of what data sources, dimensions, business rules, semantics, and metrics an organization chooses to put into common usage. It is also the equally important choice of what data sources, dimensions, business rules, semantics, and metrics an organization chooses not to put into common usage. This is by far the hardest aspect of architecture to implement and maintain because it involves organizational politics. However, determining this architecture has more to do with determining the place of the data warehouse in your business than any other architectural decision. In my opinion, the decisions involved in determining this architecture should drive all other architectural decisions. Unfortunately, this determination of this architecture seems to often be backed into than consciously made.

    Reporting data store and staging data store architecture

    The main reasons we store data in a data warehousing systems are so they can be: 1) reported against, 2) cleaned up, and (sometimes) 3) transported to another data store where they can be reported against and/or cleaned up. Determining where we hold data to report against is what I call the reporting data store architecture. All other decisions are what I call staging data store architecture. As mentioned before, there are infinite variations of this architecture. Many writings on this aspect or architecture take on a religious overtone. That its, rather than discussing what will make most sense for the organization implementing the data warehouse, the discussion is often one of architectural purity and beauty or of the writer’s conception of rightness and wrongness.

    Data modeling architecture

    This is the choice of whether you wish to use denormalized, normalized, object–oriented, proprietary multidimensional, etc. data models. As you may guess, it makes perfect sense for an organization to use a variety of models.

    Tool architecture

    This is your choice of the tools you are going to use for reporting and for what I call infrastructure.

    Processing tiers architecture

    This is your choice of what physical platforms will do what pieces of the concurrent processing that takes place when using a data warehouse. This can range from an architecture as simple as host-based reporting to one as complicated as the diagram on page 32 of Ralph Kimball’s “The Data Webhouse Toolkit”.

    Security architecture

    If you need to restrict access down to the row or field level, you will probably have to use some other means to accomplish this other than the usual security mechanisms at your organization. Note that while security may not be technically difficult to implement, it can cause political consternation.


    It is important that a data warehouse provide a concise and accurate data. Thus, this directly affects the accurate analysis from the upper management thus the organisation. The architecture is also important as it affects the system implementation. Also, if data is inconsistent, the problems occur in the longtime will be even more challenging to tackle. This will then hinder the organisation’s direction of growth.