Oracle Data Mining
Encyclopedia
Oracle Data Mining is an option of Oracle Corporation
's Relational Database Management System
(RDBMS) Enterprise Edition (EE). It contains several data mining
and data analysis
algorithms for classification, prediction
, regression
,
classification, associations
, feature selection
, anomaly detection
, feature extraction
, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.
algorithms inside the Oracle
relational database
. These implementations are integrated right into the Oracle database kernel, and operate natively on data stored in the relational database
tables. This eliminates the need for extraction or transfer of data into standalone mining/analytic servers
. The relational database
platform is leveraged to securely manage models and efficiently execute SQL
queries
on large volumes of data. The system is organized around a few generic operations providing a general unified interface for data mining
functions. These operations include functions to create
, apply
, test
, and manipulate data mining
models. Models are created and stored as database objects, and their management is done within the database - similar to tables, views, indexes and other database objects.
In data mining, the process of using a model to derive predictions or descriptions of behavior that is yet to occur is called "scoring". In traditional analytic workbenches, a model built in the analytic engine has to be deployed in a mission-critical system to score new data, or the data is moved from relational tables into the analytical workbench - most workbenches offer proprietary scoring interfaces. ODM simplifies model deployment by offering Oracle SQL functions to score data stored right in the database. This way, the user/application developer can leverage the full power of Oracle SQL - in terms of the ability to pipeline and manipulate the results over several levels, and in terms of parallelizing and partitioning data access for performance.
Models can be created and managed by one of several means. (Oracle Data Miner) is a graphical user interface that steps the user through the process of creating, testing, and applying models (e.g. along the lines of the CRISP-DM
methodology). Application and tools developers can embed predictive and descriptive mining capabilities using PL/SQL
or Java
APIs. Business analysts can quickly experiment with, or demonstrate the power of, predictive analytics using Oracle Spreadsheet Add-In for Predictive Analytics, a dedicated Microsoft Excel
adaptor interface. ODM offers a choice of well known machine learning approaches such as Decision Trees
, Naive Bayes, Support vector machine
s, Generalized linear model
(GLM) for predictive mining, Association rules, K-means and Orthogonal Partitioning Clustering (see O-Cluster paper below), and Non-negative matrix factorization for descriptive mining. A minimum description length
based technique to grade the relative importance of an input mining attributes for a given problem is also provided. Most Oracle Data Mining functions also allow text mining
by accepting Text (unstructured data) attributes as input.
Oracle Data Mining is a logical successor of the Darwin data mining toolset developed by Thinking Machines Corporation in the mid-1990s and later distributed by Oracle after its acquisition of Thinking Machines in 1999. However, the product itself
is a complete redesign and rewrite from ground-up
- while Darwin was a classic GUI-based analytical workbench, ODM offers a data mining development/deployment platform integrated into the Oracle database, along with the GUI.
Road Map - Oracle Data Miner 11gR2 New Workflow GUI was previewed at Oracle Open World 2009. See ODM Blog entry "Get Ready for Oracle Data Miner 11gR2 New Workflow SGUI" for more information http://blogs.oracle.com/datamining/2010/02/get_ready_for_the_new_oracle_data_miner_11gr2_gui_1.html
functions:
). The full functionality of SQL
can be used when preparing data for data mining, including dates and spatial data.
Oracle Data Mining distinguishes numerical, categorical, and unstructured (text) attributes. The product also provides utilities for data preparation steps prior to model building such as outlier
treatment, discretization
, normalization
and binning (sorting
in general speak)
“client” that provides access to the data mining
functions and structured templates called Mining Activities that automatically prescribe the order of operations, perform required data transformations, and set model parameters. The user interface also allows the automated generation of Java
and/or SQL
code associated with the data mining
activities. The Java Code Generator is an extension to Oracle JDeveloper. There is also an independent interface: the Spreadsheet Add-In for Predictive Analytics which enables access to the Oracle Data Mining Predictive Analytics PL/SQL
package from Microsoft Excel
.
package (DBMS_DATA_MINING) to create, destroy, describe, apply, test, export and import models. The code below illustrates a typical call to build a classification model:
where 'credit_risk_model' is the model name, built for the express purpose of classifying future customers' 'credit_risk', based on training data provided in the table 'credit_card_data', each case distinguished by a unique 'customer_id', with the rest of the model parameters specified through the table 'credit_risk_model_settings'.
Oracle Data Mining also supports a Java
API consistent with the Java Data Mining
(JDM) standard for data mining (JSR-73) for enabling integration with web and Java EE applications and to facilitate portability across platforms.
package DBMS_PREDICTIVE_ANALYTICS automates the data mining process including data preprocessing, model building and evaluation, and scoring of new data. The PREDICT operation is used for predicting target values classification or regression while EXPLAIN ranks attributes in order of influence in explaining a target column feature selection. The new 11g feature PROFILE finds customer segments and their profiles, given a target attribute. These operations can be used as part of an operational pipeline providing actionable results or displayed for interpretation by end users.
Oracle Corporation
Oracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...
's Relational Database Management System
Relational database management system
A relational database management system is a database management system that is based on the relational model as introduced by E. F. Codd. Most popular databases currently in use are based on the relational database model....
(RDBMS) Enterprise Edition (EE). It contains several data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
and data analysis
Data analysis
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making...
algorithms for classification, prediction
Prediction
A prediction or forecast is a statement about the way things will happen in the future, often but not always based on experience or knowledge...
, regression
Regression analysis
In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
,
classification, associations
Association rule learning
In data mining, association rule learning is a popular andwell researched method for discovering interesting relations between variablesin large databases. Piatetsky-Shapirodescribes analyzing and presenting...
, feature selection
Feature selection
In machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models...
, anomaly detection
Anomaly detection
Anomaly detection, also referred to as outlier detection refers to detecting patterns in a given data set that do not conform to an established normal behavior....
, feature extraction
Feature extraction
In pattern recognition and in image processing, feature extraction is a special form of dimensionality reduction.When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant then the input data will be transformed into a reduced representation...
, and specialized analytics. It provides means for the creation, management and operational deployment of data mining models inside the database environment.
Overview
Oracle Data Mining implements a variety of data miningData mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
algorithms inside the Oracle
Oracle database
The Oracle Database is an object-relational database management system produced and marketed by Oracle Corporation....
relational database
Relational database
A relational database is a database that conforms to relational model theory. The software used in a relational database is called a relational database management system . Colloquial use of the term "relational database" may refer to the RDBMS software, or the relational database itself...
. These implementations are integrated right into the Oracle database kernel, and operate natively on data stored in the relational database
Relational database
A relational database is a database that conforms to relational model theory. The software used in a relational database is called a relational database management system . Colloquial use of the term "relational database" may refer to the RDBMS software, or the relational database itself...
tables. This eliminates the need for extraction or transfer of data into standalone mining/analytic servers
Server (computing)
In the context of client-server architecture, a server is a computer program running to serve the requests of other programs, the "clients". Thus, the "server" performs some computational task on behalf of "clients"...
. The relational database
Relational database
A relational database is a database that conforms to relational model theory. The software used in a relational database is called a relational database management system . Colloquial use of the term "relational database" may refer to the RDBMS software, or the relational database itself...
platform is leveraged to securely manage models and efficiently execute SQL
SQL
SQL is a programming language designed for managing data in relational database management systems ....
queries
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
on large volumes of data. The system is organized around a few generic operations providing a general unified interface for data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
functions. These operations include functions to create
Data Definition Language
A data definition language or data description language is a syntax similar to a computer programming language for defining data structures, especially database schemas.-History:...
, apply
Apply
In mathematics and computer science, Apply is a function that applies functions to arguments. It is central to programming languages derived from lambda calculus, such as LISP and Scheme, and also in functional languages...
, test
Test method
A test method is a definitive procedure that produces a test result.A test can be considered as technical operation that consists of determination of one or more characteristics of a given product, process or service according to a specified procedure. Often a test is part of an experiment.The test...
, and manipulate data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
models. Models are created and stored as database objects, and their management is done within the database - similar to tables, views, indexes and other database objects.
In data mining, the process of using a model to derive predictions or descriptions of behavior that is yet to occur is called "scoring". In traditional analytic workbenches, a model built in the analytic engine has to be deployed in a mission-critical system to score new data, or the data is moved from relational tables into the analytical workbench - most workbenches offer proprietary scoring interfaces. ODM simplifies model deployment by offering Oracle SQL functions to score data stored right in the database. This way, the user/application developer can leverage the full power of Oracle SQL - in terms of the ability to pipeline and manipulate the results over several levels, and in terms of parallelizing and partitioning data access for performance.
Models can be created and managed by one of several means. (Oracle Data Miner) is a graphical user interface that steps the user through the process of creating, testing, and applying models (e.g. along the lines of the CRISP-DM
CRISP-DM
CRISP-DM stands for Cross Industry Standard Process for Data Mining. It is a data mining process model that describes commonly used approaches that expert data miners use to tackle problems. Polls conducted in 2002, 2004, and 2007 show that it is the leading methodology used by data miners...
methodology). Application and tools developers can embed predictive and descriptive mining capabilities using PL/SQL
PL/SQL
PL/SQL is Oracle Corporation's procedural extension language for SQL and the Oracle relational database...
or Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
APIs. Business analysts can quickly experiment with, or demonstrate the power of, predictive analytics using Oracle Spreadsheet Add-In for Predictive Analytics, a dedicated Microsoft Excel
Microsoft Excel
Microsoft Excel is a proprietary commercial spreadsheet application written and distributed by Microsoft for Microsoft Windows and Mac OS X. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications...
adaptor interface. ODM offers a choice of well known machine learning approaches such as Decision Trees
Decision tree learning
Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees...
, Naive Bayes, Support vector machine
Support vector machine
A support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...
s, Generalized linear model
Generalized linear model
In statistics, the generalized linear model is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to...
(GLM) for predictive mining, Association rules, K-means and Orthogonal Partitioning Clustering (see O-Cluster paper below), and Non-negative matrix factorization for descriptive mining. A minimum description length
Minimum description length
The minimum description length principle is a formalization of Occam's Razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978...
based technique to grade the relative importance of an input mining attributes for a given problem is also provided. Most Oracle Data Mining functions also allow text mining
Text mining
Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...
by accepting Text (unstructured data) attributes as input.
History
Oracle Data Mining was first introduced in 2002 and its releases are named according to the corresponding Oracle database release:- Oracle Data Mining 9iR2 (9.2.0.1.0 - May 2002)
- Oracle Data Mining 10gR1 (10.1.0.2.0 - February 2004)
- Oracle Data Mining 10gR2 (10.2.0.1.0 - July 2005)
- Oracle Data Mining 11gR1 (11.1 - September 2007)
- Oracle Data Mining 11gR2 (11.2 - September 2009)
Oracle Data Mining is a logical successor of the Darwin data mining toolset developed by Thinking Machines Corporation in the mid-1990s and later distributed by Oracle after its acquisition of Thinking Machines in 1999. However, the product itself
is a complete redesign and rewrite from ground-up
Rewrite (programming)
A rewrite in computer programming is the act or result of re-implementing a large portion of existing functionality without re-use of its source code. When the rewrite is not using existing code at all, it is common to speak of a rewrite from scratch...
- while Darwin was a classic GUI-based analytical workbench, ODM offers a data mining development/deployment platform integrated into the Oracle database, along with the GUI.
Road Map - Oracle Data Miner 11gR2 New Workflow GUI was previewed at Oracle Open World 2009. See ODM Blog entry "Get Ready for Oracle Data Miner 11gR2 New Workflow SGUI" for more information http://blogs.oracle.com/datamining/2010/02/get_ready_for_the_new_oracle_data_miner_11gr2_gui_1.html
Functionality
As of release 11gR1 Oracle Data Mining contains the following data miningData mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
functions:
- Data transformation and model analysis:
- Data samplingSampling (statistics)In statistics and survey methodology, sampling is concerned with the selection of a subset of individuals from within a population to estimate characteristics of the whole population....
, binningData binningData binning is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value...
, discretizationDiscretizationIn mathematics, discretization concerns the process of transferring continuous models and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and implementation on digital computers...
, and other data transformations. - Model exploration, evaluation and analysis.
- Data sampling
- Feature selectionFeature selectionIn machine learning and statistics, feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique of selecting a subset of relevant features for building robust learning models...
(Attribute Importance).- Minimum description lengthMinimum description lengthThe minimum description length principle is a formalization of Occam's Razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978...
(MDL).
- Minimum description length
- Classification.
- Naive Bayes (NB).
- Generalized linear modelGeneralized linear modelIn statistics, the generalized linear model is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to...
(GLM) for Logistic regressionLogistic regressionIn statistics, logistic regression is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. It is a generalized linear model used for binomial regression...
. - Support Vector MachineSupport vector machineA support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...
(SVM). - Decision TreesDecision tree learningDecision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees...
(DT).
- Anomaly detectionAnomaly detectionAnomaly detection, also referred to as outlier detection refers to detecting patterns in a given data set that do not conform to an established normal behavior....
.- One-class Support Vector MachineSupport vector machineA support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...
(SVM).
- One-class Support Vector Machine
- RegressionRegression analysisIn statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables...
- Support Vector MachineSupport vector machineA support vector machine is a concept in statistics and computer science for a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis...
(SVM). - Generalized linear modelGeneralized linear modelIn statistics, the generalized linear model is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to...
(GLM) for Multiple regression
- Support Vector Machine
- Clustering:
- Enhanced k-means (EKM).
- Orthogonal Partitioning Clustering (O-Cluster).
- Association rule learningAssociation rule learningIn data mining, association rule learning is a popular andwell researched method for discovering interesting relations between variablesin large databases. Piatetsky-Shapirodescribes analyzing and presenting...
:- Itemsets and association rules (AM).
- Feature extractionFeature extractionIn pattern recognition and in image processing, feature extraction is a special form of dimensionality reduction.When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant then the input data will be transformed into a reduced representation...
.
- TextText miningText mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as...
and spatial mining:- Combined text and non-text columns of input data.
- Spatial/GIS data.
Input sources and data preparation
Most Oracle Data Mining functions accept as input one relational table or view. Flat data can be combined with transactional data through the use of nested columns, enabling mining of data involving one-to-many relationships (e.g. a star schemaStar schema
In computing, the star schema is the simplest style of data warehouse schema. The star schema consists of one or more fact tables referencing any number of dimension tables...
). The full functionality of SQL
SQL
SQL is a programming language designed for managing data in relational database management systems ....
can be used when preparing data for data mining, including dates and spatial data.
Oracle Data Mining distinguishes numerical, categorical, and unstructured (text) attributes. The product also provides utilities for data preparation steps prior to model building such as outlier
Outlier
In statistics, an outlier is an observation that is numerically distant from the rest of the data. Grubbs defined an outlier as: An outlying observation, or outlier, is one that appears to deviate markedly from other members of the sample in which it occurs....
treatment, discretization
Discretization
In mathematics, discretization concerns the process of transferring continuous models and equations into discrete counterparts. This process is usually carried out as a first step toward making them suitable for numerical evaluation and implementation on digital computers...
, normalization
Database normalization
In the design of a relational database management system , the process of organizing data to minimize redundancy is called normalization. The goal of database normalization is to decompose relations with anomalies in order to produce smaller, well-structured relations...
and binning (sorting
Sorting
Sorting is any process of arranging items in some sequence and/or in different sets, and accordingly, it has two common, yet distinct meanings:# ordering: arranging items of the same kind, class, nature, etc...
in general speak)
Graphical user interface: Oracle Data Miner
Oracle Data Mining can be accessed using Oracle Data Miner a GUIGui
Gui or guee is a generic term to refer to grilled dishes in Korean cuisine. These most commonly have meat or fish as their primary ingredient, but may in some cases also comprise grilled vegetables or other vegetarian ingredients. The term derives from the verb, "gupda" in Korean, which literally...
“client” that provides access to the data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
functions and structured templates called Mining Activities that automatically prescribe the order of operations, perform required data transformations, and set model parameters. The user interface also allows the automated generation of Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
and/or SQL
SQL
SQL is a programming language designed for managing data in relational database management systems ....
code associated with the data mining
Data mining
Data mining , a relatively young and interdisciplinary field of computer science is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems...
activities. The Java Code Generator is an extension to Oracle JDeveloper. There is also an independent interface: the Spreadsheet Add-In for Predictive Analytics which enables access to the Oracle Data Mining Predictive Analytics PL/SQL
PL/SQL
PL/SQL is Oracle Corporation's procedural extension language for SQL and the Oracle relational database...
package from Microsoft Excel
Microsoft Excel
Microsoft Excel is a proprietary commercial spreadsheet application written and distributed by Microsoft for Microsoft Windows and Mac OS X. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications...
.
PL/SQL and Java interfaces
Oracle Data Mining provides a native PL/SQLPL/SQL
PL/SQL is Oracle Corporation's procedural extension language for SQL and the Oracle relational database...
package (DBMS_DATA_MINING) to create, destroy, describe, apply, test, export and import models. The code below illustrates a typical call to build a classification model:
BEGIN
DBMS_DATA_MINING.CREATE_MODEL (
model_name => 'credit_risk_model',
function => DBMS_DATA_MINING.classification,
data_table_name => 'credit_card_data',
case_id_column_name => 'customer_id',
target_column_name => 'credit_risk',
settings_table_name => 'credit_risk_model_settings');
END;
where 'credit_risk_model' is the model name, built for the express purpose of classifying future customers' 'credit_risk', based on training data provided in the table 'credit_card_data', each case distinguished by a unique 'customer_id', with the rest of the model parameters specified through the table 'credit_risk_model_settings'.
Oracle Data Mining also supports a Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
API consistent with the Java Data Mining
Java Data Mining
Java Data Mining is a standard Java API for developing data mining applications and tools. JDM defines an object model and Java API for data mining objects and processes. JDM enables applications to integrate data mining technology for developing predictive analytics applications and tools. The...
(JDM) standard for data mining (JSR-73) for enabling integration with web and Java EE applications and to facilitate portability across platforms.
SQL scoring functions
As of release 10gR2, Oracle Data Mining contains built-in SQL functions for scoring data mining models. These single-row functions support classification, regression, anomaly detection, clustering, and feature extraction. The code below illustrates a typical usage of a classification model:
SELECT customer_name
FROM credit_card_data
WHERE PREDICTION (credit_risk_model USING *) = 'LOW' AND customer_value = 'HIGH';
PMML
In Release 11gR2 (11.2.0.2), ODM supports the import of externally-created PMML for some of the data mining models. PMML is an XML-based standard for representing data mining models.Predictive Analytics MS Excel Add-In
The PL/SQLPL/SQL
PL/SQL is Oracle Corporation's procedural extension language for SQL and the Oracle relational database...
package DBMS_PREDICTIVE_ANALYTICS automates the data mining process including data preprocessing, model building and evaluation, and scoring of new data. The PREDICT operation is used for predicting target values classification or regression while EXPLAIN ranks attributes in order of influence in explaining a target column feature selection. The new 11g feature PROFILE finds customer segments and their profiles, given a target attribute. These operations can be used as part of an operational pipeline providing actionable results or displayed for interpretation by end users.
See also
- Oracle LogMinerOracle LogMinerOracle LogMiner, a utility provided by Oracle Corporation to purchasers of its Oracle database, provides methods of querying logged changes made to an Oracle database, principally though SQL commands referencing data in Oracle redo logs...
- in contrast to generic data mining, targets the extraction of information from the internal logs of an Oracle database