Data Mining
Data
mining refers loosely to the process of semi automatically analyzing large
databases to find useful patterns. It attempts to discover rules and patterns
from data.
§ deals with ‘knowledge
discovery in databases’
§ There are number of
applications of data mining , such as prediction of values based on past
examples, finding of associations between purchases , and automatic clustering
of people and movies.
Different
views on data mining:
Data mining, or Knowledge Discovery in Database (KDD) as it is
also known, is the nontrivial extraction of implicit, previously unknown, and
potentially useful information from data. This encompasses a number of
different technical approaches such as clustering, data summarization, learning
classification rules, finding dependency networks analyzing changes and
detecting anomalies – William Frawley, Gregory Piatetsky-Shapiro and
Christopher J Matheus
Data mining is the search for relationships and global patterns
that exist in large databases but are ‘hidden’ among the vast amount of data
such as a relationship between patient data and their medical diagnosis. These
relationships represent valuable knowledge about the database and the objects
in the database and if the database is a faithful mirror, of the real world
registered by the database – Marcel Holshemier and Arno Siebes (1994)
The
analogy with the mining process is described as :
Data mining refers to “using a variety of techniques to identify
nuggets of information or decision-making knowledge in bodies of data and
extracting these in such a way that they can be put to use in the areas
such as decision support, prediction, forecasting, and estimation. The data is
often voluminous, but as it stands of law value as no direct use can be made of
it,it is the hidden information in the data that is useful”
Stages of Data Mining
Selection:
selecting or segmenting the data according to some criteria e.g. all those
people who won a car, in this way subsets of the data can be determined.
Preprocessing:
this is the data cleansing stage where certain information is removed which is
deemed unnecessary and may slow down queries for example unnecessary to note
the sex of patient when studying with pregnancy. Also the data is reconfigured
to ensure a consistent format as there is a possibility of inconsitent formats
because the data is drawn from several sources e.g. sex may recorded as f or m
and also as 1 or 0,
Transformation:
the data is not merely transferred across but transformed in that
overlays may added such as demographic overlays commonly used in market research.
The data is made usable and navigable.
Data Mining: this stage is concerned with the extraction of
patterns from the data. A pattern can be defined as given a set of facts (data)
F, a language L, and some measure of certainty C a pattern is a statement S in
L that describes relationships among a subset Fs of F with a
certainty c such that S is simpler in some sense than the enumeration of
all the facts in Fs
Interpretation
and evaluation: the patterns identified by the system are interpreted into knowledge
which can then be used to support human decision-making e.g. prediction and
classification tasks, summarizing the contents of a database or explaining
observed phenomena.
There
are number of data mining methods that are available to businesses today. Among
the most common data mining methods are data reduction methods, classification
systems, and other predictive modelling tools.
Methods
§ Association rule learning
§ cluster analysis
§ structured data analysis
§ data analysis
§ Predictive analysis
§ knowledge discovery
Data
reduction tools provide a systematic way to cut down the number of variables
that you consider in your decision making with classification methods, you can
build rules that allow you to classify your new customers. After using these
data mining methods, you can employ statistical modelling to predict future
outcomes such as cross buying, detection and sales.
Predictive Analysis
§ is an area of statistical
analysis that deals with extracting information from data and using it to
predict future trends and behavior patterns. The core of predictive analytic
relies on capturing relationships between explanatory variables and the
predicted variables from past to occurrences, and exploiting it to predict
future outcomes.
§ used i actuarial science,
financial services, insurance, telecommunications, retail, travel and other
fields.
Data Analysis
§ analysis of data is a
process of inspecting, cleaning, transforming and modeling data with the goal
of highlighting useful information, suggesting conclusions and supporting
decision making.
Cluster Analysis or clustering
§ is the assignment of a
set of observations into subsets (called clusters) so that observations in the
same clusters are similar to some sense
§ clustering is a method of
unsupervised learning and a common technique for statistical data analysis used
in many fields, including machine learning, data mining, pattern recognition.
Association rule learning
§ In data mining,
association rule learning is a popular and well researched methods for
discovering interesting relations between variables in large database.
Data Mining Problem/Issues
Data
mining systems rely on databases to supply the raw data for input and this
raises problems in databases tend to be dynamic, incomplete, noisy and large.
Other problems arise as a res of the adequacy and relevance of the information
stored.
§ Limited information: A
database is often designed for purposes different from data mining and
sometimes the properties or attributes that would simplify the learning task
are not present nor can they request from the real world. Inconclusive data
causes problems because if some attributes essential to knowledge about the
application domain are not present in the data it may be impossible to discover
significant knowledge about a given domain. For example- cannot diagnose
malaria from a patient database if that database does not contain the patients
red blood cell count.
§ Noise and missing values:
Databases are usually contaminated by errors so it cannot be assumed that the
data they contain is entirely correct. Attributes which rely on subjective or
measurement judgements can give risk to errors such that some examples may even
be mis-classified. Error in either the values or attributes or class
information are known as noise. Obviously where possible it is desirable to
eliminate noise from the classification information as this affects the
overall accuracy of the generated rules.
Missing
data can be treated by discovery systems in a number of ways such as :
-
simply disregard missing values
- omit
the corresponding records
- infer
missing values from known values
- treat
missing data as a special value to be included additionally in the attribute
doamin
- or
average over the missing values using Bayesian techniques.
§ Uncertainty: Uncertainty
refers to the severity of the error and the degree of noise in the data. Data
precision is an important consideration in a discovery system.
§ Size, updates, and
irrelevant fields: Databases tend to be large and dynamic in that their
contents are ever-changing as information is added, modified or removed. The
problem with this from data mining perspective is how to ensure that the rules
are up-to-date and consistent with the most current information. Also the
learning system has to be time-sensitive as some data values vary over time and
the discovery system is affected by the ‘timeliness’ of the data.
No comments:
Post a Comment