Hadoop vs. Data Warehouse

Hadoop vs. Data Warehouse

by ExistBI

The majority of Hadoop experts believe an integrated data warehouse (IDW) is simply a huge pile of data. However, data volume has nothing to do with what makes a data warehouse. An IDW is a design pattern, an architecture for an analytics environment. First defined by Barry Devlin in 1988, the architecture quickly was called into question as implementers built huge databases with simple designs as well as small databases with complex designs.

In 1992, Bill Inmon published “Building the Data Warehouse,” which described two competing implementations: data warehouses and data marts. Gartner echoed Inmon’s position in 2005 in its research “Of Data Warehouses, Operational Data Stores, Data Marts and Data Outhouses.” Both are oversimplified in the following table.

Integrated Data WarehousesData Marts
Subject orientedSubject oriented
IntegratedDenormalized
NonvolatileNonvolatile
Time variantTime variance and currency
PersistentVirtualization option

“Subject oriented” means the IDW is a digital reflection of the business. Subject areas contain tabular data about customers, inventory, financials, sales, suppliers, accounts, etc. The IDW contains many subject areas, each of which is 250 to 5,000 relational tables. Having many subject areas enables cross-organizational analysis – often called the 360-degree view. The IDW can answer thousands of routine, ad hoc, and complex questions.

In contrast, a data mart deploys a small fraction of one or two subject areas (i.e., a few tables). With only a few tables, data marts answer far fewer questions and are poor at handling ad hoc requests from executives.

Integration in a data warehouse has many aspects. First is the standardization of data types. This means account balances contain only valid numbers, date fields have only valid dates, and so on. Integration also means rationalizing data from multiple operational applications. For example, say four corporate applications have Bill Franks, William Franks, W. J. Franks, and Frank Williams all at the same street address. Data-integration tools figure out which is the best data to put in the IDW. Data cleansing corrects messed-up data. For example, repairs are needed when “123 Oak St., Atlanta” is in the street address but the city field is blank. Data integration performs dozens of tasks to improve the quality and validity of the data. Coupled with subject areas, this is called “a single version of the truth.”

Does Hadoop Have What it Takes?

Hadoop was engineered to rely on the schema-on-read approach, in which data is parsed, reformatted, and cleansed at runtime in a manually written program. But Hadoop (and Hive) have limited to no ability to ensure valid dates and numeric account balances. In contrast, relational database management systems (RDBMS) ensure that input records conform to the database design – called the schema. According to Dr. Michael Stonebraker, “This is the best way to keep an application from adding ‘garbage’ to a data set.”

The current rage in the Hadoop community is SQL-on-Hadoop. Those who have committed to open-source Apache are playing catch-up to databases by adding SQL language features. SQL-on-Hadoop offers are a subset of the ANSI 1992 SQL language, meaning they lack features found in SQL 1999, 2003, 2006, 2008, and 2011 standards. Therefore, the business user’s ability to perform self-service reporting and analytics is throttled. This, in turn, throws a substantial labor cost back into IT to develop reports in Java.

Additionally, the lack of a database foundation also prevents SQL-on-Hadoop from achieving fast performance. Missing from Hadoop are robust indexing strategies, in-database operators, advanced memory management, concurrency, and dynamic workload management.

A consistent – sometimes angry – complaint from Hadoop experts is the poor performance in large table joins, which the SQL-on-Hadoop tools do not fix. Remember those subject areas above? Some subject areas have two to 10 tables in the 50-1,000 terabyte range. With a mature analytic database, it is a challenging problem to optimize queries that combine 50TB with 500TB, sort it, and do it fast. Fortunately, RDBMS vendors have been innovating the RDBMS and cost-based optimizers since the 1980s. A few Apache Hadoop committers are currently reinventing this wheel, intending to release a fledgling optimizer later in 2014. Again, self-service business user query and reporting suffers.

Hadoop, therefore, does not have what it takes to be a data warehouse. It is, however, nipping at the heels of data marts.

How Many Warehouses Has Hadoop Replaced?

As far as we know, Hadoop has never replaced a data warehouse, although I’ve witnessed a few failed attempts. Instead, Hadoop has been able to peel off a few workloads from an IDW. Migrating low-value data and workloads to Hadoop is not widespread, but neither is it rare.

One workload often offloaded is extract-transform-load (ETL). Technically, Hadoop is not an ETL solution. It’s a middleware infrastructure for parallelism. Hadoop requires hand coding of ETL transformations, which is expensive, especially when maintenance costs pile up in the years to come. Simple RDBMS tasks like referential integrity checks and match key lookup don’t exist in Hadoop or Hive. Hadoop does not provide typical ETL subsystem features out-of-the-box, such as:

*  hundreds of built-in data-type conversions, transformers, look-up matching, and aggregations

*  Robust metadata, data lineage, and data modeling capabilities

*  Data quality and profiling subsystems

*  Workflow management, i.e., a GUI for generating ETL scripts and handling errors

*  Fine grained, role-based security

Because migrations often come with million-dollar price tags, there is not a stampede of ETL migrations to Hadoop. Many organizations keep the low-value ETL workload in the IDW because:

*  The IDW works (it ain’t broke, don’t fix it)

*  Years of business logic must be recoded, debugged, and vetted in Hadoop (risk)

*  There are higher business value Hadoop projects to be implemented (ROI)

Nevertheless, some ETL workload migrations are justifiable. When they occur, the IDW resources freed up are quickly consumed by business users.

Similarly, Hadoop provides a parallel platform for analytics, but it does not provide the analytics. Hadoop downloads do not include report development tools, dashboards, OLAP cubes, hundreds of statistical functions, time series analysis, predictive analytics, optimization, and other analytics. These must be hand coded or acquired elsewhere and integrated into projects.

Hadoop Was Never Free

Where does this leave the cash-strapped CIO who is still under pressure? According to Phil Russom of The Data Warehousing Institute: “Hadoop is not free, as many people have mistakenly said about it. A number of Hadoop users speaking at recent TDWI conferences have explained that Hadoop incurs substantial payroll costs due to its intensive hand coding normally done by high-payroll personnel.”

This reflects the general agreement in the industry, which is that Hadoop is far from free. The $1,000/terabyte hardware costs are hype to begin with, and traditional vendors are closing in on Hadoop’s hardware price advantage anyway. Additionally, some SQL-on-Hadoop offerings are separately priced as open source vendors seek revenue. If you want Hadoop to be fast and functional, well, that part is moving away from free and toward becoming a proprietary, priced database.

Hadoop Jumps in the Lake

Mark Madsen, President of Third Nature, gives some direction on Hadoop benefits: “Some of the workloads, particularly when large data volumes are involved, require new storage layers in the data architecture and new processing engines. These are the problems Hadoop and alternate processing engines are equipped to solve.”

Hadoop defines a new market, called the data lake. Data lake workloads include the following:

*  Many data centers have 50 million to 150 million files. Organizing this into a cohesive infrastructure, knowing where everything is, its age, its value, and its upstream/downstream uses is a formidable task. The data lake concept is uniquely situated to solve this.

*  Hadoop can run parallel queries over flat files. This allows it do basic operational reporting on data in its original form.

*  Hadoop excels as an archival subsystem. Using low-cost disk storage, Hadoop can compress and hold onto data in its raw form for decades. This avoids the problem of crumbling magnetic tapes and current software versions that can’t read the tape they produced eight years earlier. A close cousin to archival is backup-to-disk. Again, magnetic tape is the competitor.

*  Hadoop is ideal for temporary data that will be used for a month or two then discarded. There are many urgent projects that need data for a short time then never again. Using Hadoop avoids the lengthy process of getting data through committees into the data warehouse.

*  Hadoop, most notably YARN from Hortonworks, is providing the first cluster operating system. This is amazing stuff. YARN improves Hadoop cluster management but does not change Hadoop’s position vis-à-vis the data warehouse.

Apples and Oranges

Bob Page, the VP of Development at Hortonworks, weighed in on the Hadoop versus IDW debate: “We don’t see anybody today trying to build an IDW with Hadoop. This is a capability issue, not a cost issue. Hadoop is not an IDW. Hadoop is not a database. Comparing these two for an IDW workload is comparing apples to oranges. I don’t know anybody who would try to build an IDW in Hadoop. There are many elements of the IDW on the technical side that are well refined and have been for 25 years. Things like workload management, the way concurrency works, and the way security works – there are many different aspects of a modern IDW that you are not going to see in Hadoop today. I would not see these two as equivalent.”

Hadoop’s success won’t come as a low-priced imitation of a data warehouse. Instead, I continue to be bullish on Hadoop as we witness the birth of the data lake with predictable birthing pains. Over the next couple of years, the hype will quiet down and we can get to work exploiting the best Hadoop has to offer.

Salesforce CRM Data Migration Success Story, Large US Insurance Company

ExistBI helped a large US insurance Company migrate around 40 Countries (Markets) into their global Salesforce CRM implementation, containing 15 datasets per market.  We had three phases in migration per market. In SIT we focused on cleaning up the data and making sure it’s accurate from a technical perspective and in the correct format.  In UAT we focused on data quality and the correct relationships, we then loaded to production.

Additionally, we had ongoing interfaces and integrations for Customer, Product and Sales data with other systems. For migration and integration we used Informatica, Oracle for staging area/database, and for other data related tasks we used Talend.

Depending on the Data Migration Services your organization requires, we can offer resources in your office or via onshore or nearshore remote access. For a FREE Assessment or Quote, Please Complete the Contact Us Form or Call: US/Canada +1 866 965 6332 | UK/Europe +44 (0)207 554 8568.

Integrating Clinical Data With Informatica B2B Data Exchange

ExistBI delivered a project integrating clinical data with Informatica B2B Data Exchange. The project involved helping with business analytics, sharing clinical data, getting the business users involved and helping simplify the process to creating new studies. We helped delivery the Informatica B2B DX Solution by tracking clinical data loading, partner interaction, restricting access to management policy and defining the clinical study definitions. ExistBI are a leading Informatica Consulting, Informatica Training and support organization with offices in the US, UK and Europe.

To learn more about our Informatica consulting services, give us a call or send inquiries to consulting@existbi.com.

Microsoft SharePoint 2013 Project For McDonalds Restaurants

ClickMcD is the provision of a SharePoint 2013 intranet platform to integrate various other platforms and technologies in use across the European business, focusing on collaboration and communication. The platform is under continual development including the integration of new features, tools and technologies, which are made available as part of the platform to the European markets who will be and are currently employing the platform. ClickMcD is above all easy to use and has a surprising array of features, which enable content owners to collaborate easily and quickly, presenting data and information with just a few clicks.

Our part of the project includes web part customization, site provisioning, SharePoint support and support for existing customized functionalities.  To learn more about our SharePoint consulting and support, send an e-mail to consulting@existbi.com

EU Bank BI / Data Warehouse Project: BI / DWH System Implementation

(Cognos 8, SQL Server 2005 EE)

ExistBI banking client recognized the need of today’s bank market to tailor services to clients, quickly and accurately with a more comprehensive offer.

Our client decided to be more customer-centric organization by having single, proactive points of contact with its customers. To reach this goal, the bank needed to consolidate and extract more meaningful information from its customer data sources. In addition, the bank’s marketing campaigns had to improve its quality, become more focused and relevant to targeted customers. They needed to go toward democratization of information and broadening decision-making responsibilities. That requires timely delivery of relevant information to each decision maker.

Therefore, it was decided a business analytics solution was necessary, starting with a new Data Warehouse. These solutions will provide the client trend analysis, performance feedback and facilitate decisive actions that result in measurable gains in cost efficiency or revenue growth.

Download Case Study: EU Bank BI / Data Warehouse Project

Exist Audit: EDW Design, Architecture and Implementation

Our client is one of the world’s pre-eminent manufacturers of airplane engines and other aerospace related engineering. They sought to integrate an SAP environment with a number of other systems into an Enterprise Data Warehouse. They had hired a large Systems Integrator to design and develop the EDW. The tools chosen were an Oracle database and Informatica PowerCenter and Informatica Data Quality.

We were brought in to audit the EDW design, architecture and implementation and to provide any recommendations.

Download Case Study: Exist Audit: EDW Design, Architecture and Implementation

Business Intelligence Market Study – 2013

Benefits of the Study

This Business Intelligence Market Study provides a wealth of information and analysis – offering value to both consumers and producers of Business Intelligence technology and services.

Download: Business Intelligence Market Study

Enterprise Information Management Strengthens Your Information Value

Why Read This Report?

Businesses increasingly rely on information to make smarter, faster decisions for competitive advantage. Although business leaders want access to all kinds of information, structured data and unstructured

content are often stored separately and have disconnected architectures. Enterprise information management (EIM) encompasses the processes, policies, technologies, and architectures that capture, consume, and govern the usage of an organization’s structured data and unstructured content. EIM enables businesses to derive more value from their data and content, harmonizing what has traditionally been a dichotomy. Forrester proposes a logical representation for EIM that unifies data management and content management using a common set of foundational technologies.

Download Report: Enterprise Information Management Strengthens Your Information Value

Data Quality for BASEL II

Trillium Software’s BASEL II Solution helps large, complex financial services institutions meet the standards for data quality attestation by the FFEIC regulators of risk reporting.

The BASEL II Solution delivers a data quality process management platform and domain expertise to support the rigorous demands of commercial and retail risk reporting and attestation. The solution leverages pre-built risk data quality best practice templates that provide new levels of transparency to reporting and attestation processes…

Download Paper: Data Quality for BASEL II

Why MDM For Your Business?

This White Paper demonstrates how MDM can improve operations with business-user access to consolidated and reliable business-critical data, timely access to data about customers, products, channel partners, suppliers, employees, and more…

existmgt

White Paper: Gartner Magic Quadrant for Data Quality Tools

Demand for data quality tools remains strong, with more focus on deployments that support information governance programs, master data management initiatives and application modernization efforts…

existmgt

White Paper: Gartner Magic Quadrant for Data Integration Tools

IT Leaders focused on data integration competencies must be aware of key trends and developments in the data integration tools market. During 2012, organizations continued to emphasize cost-effectiveness…

existmgt

IBM Cognos Consulting and Implementation

Our client is a large government body rolling out a common accounting platform across the entire region for which this government department was responsible.

existmgt

How ExistBI Improved a Large Companies BI Environment & Helped Implement SAP Business Objects

Our client is a large Internet company with a variety of ERP & CRM distributed systems as well as different business information sectors such as accounting, billing, sales, profits, Credits, Products, Proposals and Online operational data…

existmgt

How ExistBI Helped A Major Grocery Retailer With Their BI Strategy

As part of a major systems implementation Exist led the definition of a Business Intelligence Strategy for a major grocery retailer in the UK. The retailer, who has over 450 stores and turnover in excess of £10 billion…

existmgt

What Executives Do Not Understand About Big Data

How much more profitable would your business be if you had, for free, access to 100 times more data about your customers?

existmgt