Data Warehouse
Data warehouse is a repository of an organization’s electronically
stored data. Data warehouses are designed to facilitate reporting and analysis.
Data
mining potential can be enhanced if the appropriate data has been collected and
stored in a data warehouse. A data warehouse is a relational database
management system (RDBMS) designed specifically to meet the needs of
transaction processing systems. It can be loosely defined as any centralized
data repository which can be queried for business benefit. Data warehousing is
a new powerful technique making it possible to extract archived operational
data and overcome inconsistencies between different legacy data formats.
A data
warehouse houses a standardized, consistent, clean and integrated form of data
sourced from various operational systems in use in the organization, structured
in a way to specifically address the reporting and analytic requirements.
This
definition of data warehouse focuses on data storage. However, the means to
retrieve and analyze data, to extract, transform and load data, and to manage
the data dictionary are also considered essential components of a data
warehousing system. Many references to data warehousing use this broader
context. Thus, an expanded definition for data warehousing includes business
intelligence tools, tools to extract, transform, and load data into the
repository, and tools to manage and retrieve metadata.
Characteristics of a data warehouse
§ subject-oriented: data
are organized according to subject instead of application e.g. an insurance
company using a data warehouse would organize their data by customer, premium,
and claim, instead of by different products (auto, life etc). The data organized
by subject contain only the information necessary for decision support
processing.
§ integrated: when data
resides in many separate applications in the operational environment, encoding
of data is often inconsistent. For instance, in one application gender might be
coded as “m” and “f” in another by 0 and 1. When data are moved form the
operational environment into the data warehouse, they assume a consistent
coding convention eg. gender data is transformed to “m” and “f”.
§ time-variant: The data
warehouse contains a place for storing data that are five to 10 years old, or
older, to be used for comparisons, trends, and forecasting. These data are not
updated.
§ non-volatile: Data are
not updated or changed in any way once they enter the data warehouse, but are
only loaded and accessed.
Criteria for a data warehouse
§ Load Performance:
performance of the load process should be measured in hundreds of millions of
rows and gigabytes per hour and must not artificially constrain the volume of
data required by the business.
§ Load Processing: Many
steps must be taken to load new or updated data into the data warehouse
including data conversions, filtering, reformatting, integrity checks, physical
storage, indexing, and ,metadata update. There steps must be executed as a single
seamless unit of work.
§ Data Quality Management:
The shift to fact-based management demands the highest data quality. The
warehouse must ensure local consistency, global consistency, and referential
integrity despite “dirty” sources and massive database size.
§ Query Performance:
Fact-based management and ad-hoc analysis must not be slower or inhibited by
the performance of the data warehouse RDBMS: large, complex queries for key
business operations must complete in seconds not days.
§ Terabyte Scalability –
Data warehouse sizes are growing at astonishing rates. The RDBMS must not have
any architectural limitations. It must support modular and parallel management.
It must support continued availability in the event of a point failure and must
provide a fundamentally different mechanism for recovery.
§ Mass User Scalability:
the RDBMS server must support hundreds, even thousands of concurrent users
while maintaining acceptable query performance.
§ Networked Data Warehouse:
The server must include tools that coordinate the movement of subsets of data
between warehouses.Users must be able to look at and work with multiple
warehouses from a single client workstation. Warehouse managers have to manage
and administer a network of warehouses from a single physical location.
§ Warehouse Administration:
The RDBMS must provide controls for implementing resource limits, chargeback
accounting to allocate costs back to users and query prioritization to address
the needs of different user classes and activities.
§ Integrated Dimensional
Analysis: The power of multidimensional views is widely accepted, and
dimensional support must be inherent in the warehouse RDBMS to provide the
highest performance for relational OLAP tools. The RDBMS must support fast,easy
creation of precomputed summaries common in large data warehouses.
§ Advanced Query
functionality: The RDBMS must provide a complete set of analytical operations
including core sequential and statistical operations.
Conceptualization
of a data warehouse architecture consists of the following interconnected
layers:
Operational
database layer:http://pratikshya.com.np/wp-admin/post-new.php
the
source data layer for the data warehouse- an organization’s enterprise resource
planning systems fall into this layer
Data
Access layer:
The
interface between the operational and informational access layer- Tools to
extract, transform, load data into the warehouse fall into this layer
Metadata
layer:
THe
data dictionary- This is usually more detailed than an operational sysem data
directory. There are dictionaries for the entire warehouse and sometimes
dictionaries for the data that can be accessed by a particular reporting and
analysis tool.
Informational
access layer:
The
data accessed for reporting and analysing and the tools for reporting and
analyzing data- business intelligence tools fall into this layer.
Some of the Benefits that a data warehouse provides are as follows:
A data
warehouse provides a common data model for all data of interest regardless of
the data’s source. This makes it easier to report and analyse information than
it would be if multiple data models were used to retrieve information such as
sales invoices, order receipts, general ledger charges, etc.
§ Prior to loading data
into the data warehouse, inconsistencies are identified and resolved. This
greatly simplifies reporting and analysis
§ information in the data
warehouse is under the control of data warehouse users so that, even if the
source system data is purged over time, the information in the warehouse can be
stored safely for extended periods of time
§ Because they are separate
form operational systems, data warehouses provides retrieval of data without
slowing down operational system
§ Data warehouses can work
in conjunction with and hence, enhance the value of operational business
applications, notably customer relationship management (CRM) systems
§ Data warehouses facilitates
decision support system applications such as trend reports (eg. the items with
the most sales in a particular area within the last two years), exception
reports, and reports that show actual performance versus goals.
Disadvantages
§ Data warehouses are not
the optimal environment for unstructured data
§ Because data must be
extracted, transformed and loaded into the warehouse, there is an element of
latency in data warehouse data
§ Over their life, data
warehouses can have high cost
§ Data warehouses can get
outdated relatively quickly. There is a cost of delivering suboptimal
information to organization
§ There is often a fine
line between data warehouses and operational systems. Duplicate, expensive
functionality may be developed. Or, functionality may be developed in the data
warehouse that, in retrospect, should have been developed in the operational
systems.
Sample Applications
§ Decision Support
§ Trend analysis
§ Financial forecasting
§ Churn prediction for
telecom subscribers, credit card users etc
§ insurance fraud analysis
§ call record analysis
§ Logistics and inventory
management
§ Agriculture
No comments:
Post a Comment