Unit-3
Classification and Prediction
What Is Classification? What Is
Prediction?
Classification
and prediction are two forms of data analysis that can be used to extract
models describing important data classes or to predict future data trends. Such
analysis can help provide us with a better understanding of the data at large.
Whereas classification predicts categorical (discrete, unordered)
labels, prediction models continuous valued functions.
Example:-
·
A
bank loans officer needs analysis of her data in order to learn which loan
applicants are “safe” and which are “risky” for the bank. Here the data
analysis task is classification, where a model or classifier is constructed to
predict categorical labels, such as “safe” or “risky” for the
loan application data.
·
Suppose
that the marketing manager would like to predict how much a given customer will
spend during a sale at AllElectronics. This data analysis task is an
example of numeric prediction, where the model constructed predicts a continuous-valued
function,or ordered value,
What
are steps in Data Classification?
Ø Each
tuple/sample is assumed to belong to a predefined class, as determined by the
class label attribute
Ø The set of
tuples used for model construction is training set
Ø The model is
represented as classification rules, decision trees, or mathematical formulae
Data
classification is a two-step process
(a) Learning: Training data are
analyzed by a classification algorithm. Here, the class label attribute and the
learned model or classifier is represented in the form of classification rules.
(b) Classification:
Test data are used to estimate the accuracy of the classification rules. If the
accuracy is considered acceptable, the rules can be applied to the
classification of new data tuples.
To Estimate accuracy of the model
we do the following-
Ø The known label
of test sample is compared with the classified result from the model
Ø Accuracy rate is
the percentage of test set samples that are correctly classified by the model
Ø Test set is
independent of training set, otherwise over-fitting will occur
Ø If the accuracy
is acceptable, use the model to classify data tuples whose class labels are not
known
Process
(1): Model Construction
Supervised vs. Unsupervised Learning
- Supervised
learning (classification)
1)
Supervision: The training data
(observations, measurements, etc.) are accompanied by labels indicating the
class of the observations
2)
New data is classified based on the
training set
- Unsupervised
learning (clustering)
1)
The class labels of training data is
unknown
2)
Given a set of measurements,
observations, etc. with the aim of establishing the existence of classes or
clusters in the data
Issues Regarding
Classification and Prediction
Ø
Relevance
analysis: Many of the attributes in the data may be redundant.
Correlation analysis can be used to identify whether any two given attributes
are statistically related.
Ø
Data
transformation and reduction: The data may be transformed by normalization,
particularly when neural networks or methods involving distance measurements
are used in the learning step. Normalization involves scaling all values for a
given attribute so that they fall within a small specified range, such as 1:0
to 1:0, or 0:0 to 1:0. In methods that use distance measurements, for example,
this would prevent attributes with initially large ranges (like, say, income)
from outweighing attributes with initially smaller ranges (such as binary
attributes).
Comparing
Classification and Prediction Methods
Ø Accuracy: The
accuracy of a classifier refers to the ability of a given classifier to
correctly predict the class label of new or previously unseen data.
Ø Speed: This
refers to the computational costs involved in generating and using the given
classifier or predictor.
Ø Robustness: This
is the ability of the classifier or predictor to make correct predictions given
noisy data or data with missing values.
Ø Scalability:
This refers to the ability to construct the classifier or predictor efficiently
given large amounts of data.
Ø Interpretability:
This refers to the level of understanding and insight that is provided by the
classifier or predictor. Interpretability is subjective and therefore more
difficult to assess.
What
are the types of Classification techniques ?
1) Classification
by Decision Tree Induction
Decision
tree induction is the learning of decision trees from class-labeled
training tuples.A decision tree is
a flowchart-like tree structure, where each internal node
(nonleaf node) denotes a test on an attribute, each branch
represents an outcome of the test, and each leaf
node (or terminal node)
holds a class label. The topmost node in a tree is the root
node.
Decision
Tree Induction: Training Dataset
Output: A Decision Tree for “buys_computer”
How to select
the root?
Attribute
Selection Measure:
Ø Information Gain (ID3/C4.5)
- Select the attribute with the highest
information gain
- Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|
- Expected information (entropy) needed to classify a tuple in D:
From
the above example:-total D=14 TUPLES
(9-buys computer-yes) ( 5-buys computer-no)
THE
ATTRIBUTE WITH MAXIMUN GAIN IS MADE THE ROOT WHICH IS AGE IN THIS CASE.
Gain
Ratio for Attribute Selection (C4.5)
n Information gain measure is biased towards attributes with a
large number of values
n
C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)
C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)
n GainRatio(A) = Gain(A)/SplitInfo(A)
n The attribute with the maximum gain ratio is selected as the
splitting attribute
2) Rule-Based Classification
Using
IF-THEN Rules for Classification
Rules are a good
way of representing information or bits of knowledge. A rule-based classifier
uses a set of IF-THEN rules for classification. An IF-THEN rule is an
expression of the form
IF condition THEN conclusion.
|
An example is
rule R1,
R1: IF age =
youth AND student = yes THEN buys computer = yes.
The “IF”-part
(or left-hand side)of a rule is known as the rule antecedent or precondition. The
“THEN”-part (or right-hand side) is the rule consequent. In the rule
antecedent, the condition consists of one or more attribute tests (such
as age = youth, and student = yes) that are
logically ANDed. The rule’s consequent contains a class prediction (in this
case,we are predicting whether a customer will buy a computer). R1 can
also be written as-
R1:
(age = youth) ^ (student = yes))->(buys
computer = yes).
A rule R can
be assessed by its coverage and accuracy. Given a tuple, X, from
a class labeled data set,D, let ncovers be the number of tuples
covered by R; ncorrect be the number of tuples correctly
classified by R; and jDj be the number of tuples in D.
We can define
the coverage and accuracy of R as
coverage(R) =ncovers/D
Accuracy(R) =ncorrect/D
Rule Extraction from
a Decision Tree
To extract rules from a decision
tree, one rule is created for each path from the root to a leaf node. Each
splitting criterion along a given path is logically ANDed to form the rule
antecedent (“IF” part). The leaf node holds the class prediction, forming the
rule consequent (“THEN” part).
Algorithm: Sequential covering. Learn a set of IF-THEN
rules for classification.
Input:
D, a data set class-labeled
tuples;
Att vals, the set of all attributes and
their possible values.
Output: A set of IF-THEN rules.
Method:
(1) Rule set = fg; // initial set of rules
learned is empty
(2) for each class c do
(3) repeat
(4) Rule = Learn One Rule(D, Att vals,
c);
(5) remove tuples covered by Rule from D;
(6) until terminating condition;
(7) Rule set = Rule set +Rule;
// add new rule to rule set
(8) endfor
(9)
return Rule Set;
Examples
of set D are
3) Classification by Back propagation
- Back
propagation is a neural network learning algorithm. Roughly speaking, a
neural network is a set of connected input/output units in which each
connection has a weight associated with it. During the learning phase, the
network learns by adjusting the weights so as to be able to predict the
correct class label of the input tuples. Neural network learning is also
referred to as connectionist learning due to the connections between units
- Advantages
of neural networks, however, include their high tolerance of noisy data as
well as their ability to classify patterns on which they have not been
trained. They can be used when you may have little knowledge of the
relationships between attributes and classes.
- The
back propagation algorithm performs learning on a multilayer
feed-forward neural network. It iteratively learns a set of weights
for prediction of the class label of tuples. A multilayer feed-forward
neural network consists of an input layer, one or more hidden
layers, and an output layer.
- Each
layer is made up of units. The inputs to the network correspond to the
attributes measured for each training tuple. The inputs are fed
simultaneously into the units making up the input layer. These inputs pass
through the input layer and are then weighted and fed simultaneously to a
second layer of “neuronlike” units, known as a hidden layer. The outputs
of the hidden layer units can be input to another hidden layer, and so on.
- The number of hidden layers is arbitrary, although in practice, usually only one is used. The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network’s prediction for given tuples.The units in the input layer are called input units. The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic biological basis,or as output units.
STEPS:-
- Defining Network Topology
Before training can begin, the
user must decide on the network topology by specifying the number of units in
the input layer, the number of hidden layers(if more than one), the number of
units in each hidden layer, and the number of units in the output layer.
- Backpropagation
Back propagation learns by
iteratively processing a set of training samples, comparing the network’s predicition
for each sample with the actual known class label. For each training sample,
the weights are modified so as to minimize the mean squared error between the
network’s prediction and the actual class. These
modifications are made in the “backwards” direction, that is , form the output
layer through each hidden layer down to the first hidden layer (hence the name
backpropagation).
- Initialize
the weights. The weights in the network are initialized to small random
number(e.g., ranging from -1.0 to1.0,or -0.5 to 0.5).
- Propagate the inputs forward
In
this step,the net input and output of each unit in the hidden and output layers
are computed.
Q) WHAT ARE LAZY AND
EAGER LEARNERS?
Lazy vs. Eager Learning
n Lazy
learning (e.g., instance-based learning): Simply stores training data (or only
minor processing) and waits until it is given a test tuple
n Eager
learning (the above discussed methods): Given a set of training set, constructs
a classification model before receiving new (e.g., test) data to classify
n Lazy: more time in predicting
n Accuracy:
More in lazy learners.
1)The
k-Nearest Neighbor Algorithm
Nearest-neighbor
classifiers are based on learning by analogy, that is, by comparing a given
test tuple with training tuples that are similar to it. The training tuples are
described by n attributes. Each tuple represents a point in an n-dimensional
space. In this way, all of the training tuples are stored in an n-dimensional
pattern space. When given an unknown tuple, a k-nearest-neighbor
classifier searches the pattern space for the k training tuples that are
closest to the unknown tuple. These k training tuples are the k “nearest
neighbors” of the unknown tuple.
“Closeness” is
defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between
two points or tuples, say, X1 = (x11, x12, : : : , x1n)
and
X2
= (x21, x22, : : : , x2n), is
2)Case-Based
Reasoning (CBR)
- CBR: Uses a database of problem and
solutions to solve new problems
- Store symbolic description
(tuples or cases)—not points in a Euclidean space
- Applications:
Customer-service (product-related diagnosis), legal ruling
- Methodology
1) Instances
represented by rich symbolic descriptions (e.g., function graphs)
2) Search
for similar cases, multiple retrieved cases may be combined
3) Tight
coupling between case retrieval, knowledge-based reasoning, and problem solving
- Challenges
1) Find
a good similarity metric
2) Indexing
based on syntactic similarity measure,
and when failure, backtracking, and adapting to additional cases
Other Classification Methods
Genetic
Algorithms (GA)
Genetic
algorithms attempt to incorporate ideas of natural evolution. In general,
genetic learning starts as follows. An initial population is created consisting
of randomly generated rules. Each rule can be represented by a string of bits.
As a simple
example, suppose that samples in a given training set are described by two
Boolean attributes, A1 and A2, and that there are two classes,C1
andC2. The rule “IF A1 ANDNOT A2 THENC2” can be
encoded as the bit string “100,” where the two leftmost bits represent
attributes A1 and A2, respectively, and the rightmost bit
represents the class. Similarly, the rule “IF NOT A1 AND NOT A2
THENC1” can be encoded as “001.”
Based on the notion of survival
of the fittest, a new population is formed to consist of the fittest rules
in the current population, as well as offspring of these rules.
fitness of a rule is assessed by
its classification accuracy on a set of training samples.
Offspring are
created by applying genetic operators such as crossover and mutation. In
crossover, substrings from pairs of rules are swapped to form new pairs of
rules. In mutation, randomly selected bits in a rule’s string are inverted.
PREDICTION
What
Is Prediction?
n Prediction is different from classification
because:-
1) Classification refers to predict categorical class label(yes/no)
2) Prediction models continuous-valued functions(salary)
Major method for
prediction: Regression
It models the relationship between one or more independent or predictor
variables and a dependent or response variable.
Ø Linear
Regression
Straight-line regression analysis
involves a response variable, y, and a single predictor variable, x.
It is the simplest form of regression, and models y as a linear function
of x.
That is, y = b+wx;
where the variance of y is assumed to be constant, and b and w
are regression coefficients specifying the Y-intercept and slope of the
line, respectively.
>>>>>Refer
to the numerical solved in class…………………………………..>
Ø Nonlinear
Regression
1) Some nonlinear models can be modeled by a polynomial
function
2) A polynomial regression model can be transformed
into linear regression model. For
example,
y = w0
+ w1 x + w2 x2 + w3 x3
convertible to linear with
new variables: x2 = x2, x3= x3
y = w0 + w1 x +
w2 x2 + w3 x3
What are Classifier Accuracy Measures? Explain Confusion
Matrix?
1)
The accuracy of a classifier on a given test set is
the percentage of test set tuples that are correctly classified by the
classifier.
2)
The confusion matrix is a useful tool for
analyzing how well your classifier can recognize tuples of different classes.
Given two classes, we can talk in terms of positive tuples (tuples of the main
class of interest, e.g., buys computer = yes) versus negative tuples (e.g.,
buys computer = no).
3)
True positives refer to the positive tuples that
were correctly labeled by the classifier, while true negatives are the negative
tuples that were correctly labeled by the classifier.
4)
False positives are the negative tuples that were
incorrectly labeled (e.g., tuples of class
buys computer = no for which the
classifier predicted buys computer = yes
C1
|
C2
|
|
C1
|
True
positive
|
False negative
|
C2
|
False positive
|
True negative
|
Features of Confusion Matrix
1)
Accuracy of a classifier M, acc(M):
percentage of test set tuples that are correctly classified by the model M
1. Error
rate (misclassification rate) of M = 1 – acc(M)
2. Given
m classes, CMi,j, an entry in a confusion matrix,
indicates no. of tuples in class i
that are labeled by the classifier as class j
2)
Alternative
accuracy measures
a)
Sensitivity = t-pos/pos /* true positive recognition rate
*/
b)
Specificity = t-neg/neg /* true negative recognition rate
*/
c)
Precision = t-pos/(t-pos + f-pos)
d)
Accuracy = sensitivity * pos/(pos +
neg) + specificity * neg/(pos + neg)
This model can also be
used for cost-benefit analysis
What are the Predictor
Error Measures?
- Measure predictor accuracy: measure
how far off the predicted value is from the actual known value
- Loss function:
measures the error between. yi and the predicted value yi’
a) Absolute
error: | yi – yi’|
b) Squared
error: (yi – yi’)2
Q) How can we evaluate the Accuracy of
a Classifier or Predictor?
Ans) There are various
methods of evaluating accuracy of a classifier or predictor.
- Holdout
Method and Random Sub sampling
Ø
In
this method, the given data are randomly partitioned into two independent sets,
a training set and a test set. Typically, two-thirds of the data
are allocated to the training set, and the remaining one-third is allocated to
the test set. The training set is used to derive the model, whose accuracy is
estimated with the test set
Ø Random sub
sampling is a variation of the holdout method in which the holdout method is
repeated k times. The overall accuracy estimate is taken as the average
of the accuracies obtained from each iteration.
- Cross-validation
Ø
In
k-fold cross-validation, the initial data are randomly
partitioned into k mutually exclusive subsets or “folds,” D1, D2,
: : : , Dk, each of approximately equal size. Training and testing is
performed k times. In iteration i, partition Di is
reserved as the test set, and the remaining partitions are collectively used to
train the model. That is, in the first iteration, subsets D2, : : : , Dk
collectively serve as the training set in order to obtain a first model,
which is tested on D1; the second iteration is trained on subsets D1,
D3, : : : , Dk and tested on D2; and so on.
- Bootstrap
Ø
Unlike
the accuracy estimation methods mentioned above, the bootstrap method samples
the given training tuples uniformly with replacement. That is, each time
a tuple is selected, it is equally likely to be selected again and re added to
the training set. For instance, imagine a machine that randomly selects tuples
for our training set. In sampling with replacement, the machine is
allowed to select the same tuple more than once.
There are several bootstrap
methods. A commonly used one is the .632 bootstrap,
which
works as follows. Suppose we are given
a data set of d tuples. The data set is sampled d times,
with replacement, resulting in a bootstrap sample or training set of d
samples.
It is very likely that some of the original data tuples will occur more than
once in
this sample. The data tuples that did not make it into the training set end up
forming the test set
please provide notes on cluster analysis
ReplyDeleteWhere is Cluster Analysis ???
ReplyDelete