Notes 4 Learners: DWDM Unit-3

Unit-3

Classification and Prediction

What Is Classification? What Is Prediction?

Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Such analysis can help provide us with a better understanding of the data at large. Whereas classification predicts categorical (discrete, unordered) labels, prediction models continuous valued functions.

Example:-

· A bank loans officer needs analysis of her data in order to learn which loan applicants are “safe” and which are “risky” for the bank. Here the data analysis task is classification, where a model or classifier is constructed to predict categorical labels, such as “safe” or “risky” for the loan application data.

· Suppose that the marketing manager would like to predict how much a given customer will spend during a sale at AllElectronics. This data analysis task is an example of numeric prediction, where the model constructed predicts a continuous-valued function,or ordered value,

What are steps in Data Classification?

Ø Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute

Ø The set of tuples used for model construction is training set

Ø The model is represented as classification rules, decision trees, or mathematical formulae

Data classification is a two-step process

(a) Learning: Training data are analyzed by a classification algorithm. Here, the class label attribute and the learned model or classifier is represented in the form of classification rules.

(b) Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples.

To Estimate accuracy of the model we do the following-

Ø The known label of test sample is compared with the classified result from the model

Ø Accuracy rate is the percentage of test set samples that are correctly classified by the model

Ø Test set is independent of training set, otherwise over-fitting will occur

Ø If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

Process (1): Model Construction

Process (2): Using the Model in Prediction

Supervised vs. Unsupervised Learning

Supervised learning (classification)

1) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations

2) New data is classified based on the training set

Unsupervised learning (clustering)

1) The class labels of training data is unknown

2) Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Issues Regarding Classification and Prediction

Ø Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by applying smoothing techniques, for example) and the treatment of missing values (e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics). Although most classification algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning.

Ø Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can be used to identify whether any two given attributes are statistically related.

Ø Data transformation and reduction: The data may be transformed by normalization, particularly when neural networks or methods involving distance measurements are used in the learning step. Normalization involves scaling all values for a given attribute so that they fall within a small specified range, such as 1:0 to 1:0, or 0:0 to 1:0. In methods that use distance measurements, for example, this would prevent attributes with initially large ranges (like, say, income) from outweighing attributes with initially smaller ranges (such as binary attributes).

Comparing Classification and Prediction Methods

Ø Accuracy: The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data.

Ø Speed: This refers to the computational costs involved in generating and using the given classifier or predictor.

Ø Robustness: This is the ability of the classifier or predictor to make correct predictions given noisy data or data with missing values.

Ø Scalability: This refers to the ability to construct the classifier or predictor efficiently given large amounts of data.

Ø Interpretability: This refers to the level of understanding and insight that is provided by the classifier or predictor. Interpretability is subjective and therefore more difficult to assess.

What are the types of Classification techniques ?

1) Classification by Decision Tree Induction

Decision tree induction is the learning of decision trees from class-labeled training tuples.A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal node) holds a class label. The topmost node in a tree is the root node.

Decision Tree Induction: Training Dataset

Output: A Decision Tree for “buys_computer”

How to select the root?

Attribute Selection Measure:

Ø Information Gain (ID3/C4.5)

Select the attribute with the highest information gain
Let p_i be the probability that an arbitrary tuple in D belongs to class C_i, estimated by |C_i_{, D}|/|D|
Expected information (entropy) needed to classify a tuple in D:

4 . Information needed (after using A to split D into v partitions) to classify D:

5. Information gained by branching on attribute A

From the above example:-total D=14 TUPLES

(9-buys computer-yes) ( 5-buys computer-no)

means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s.

THE ATTRIBUTE WITH MAXIMUN GAIN IS MADE THE ROOT WHICH IS AGE IN THIS CASE.

Gain Ratio for Attribute Selection (C4.5)

n Information gain measure is biased towards attributes with a large number of values

n
C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)

n GainRatio(A) = Gain(A)/SplitInfo(A)

n The attribute with the maximum gain ratio is selected as the splitting attribute

2) Rule-Based Classification

Using IF-THEN Rules for Classification

Rules are a good way of representing information or bits of knowledge. A rule-based classifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of the form

IF condition THEN conclusion.

An example is rule R1,

R1: IF age = youth AND student = yes THEN buys computer = yes.

The “IF”-part (or left-hand side)of a rule is known as the rule antecedent or precondition. The “THEN”-part (or right-hand side) is the rule consequent. In the rule antecedent, the condition consists of one or more attribute tests (such as age = youth, and student = yes) that are logically ANDed. The rule’s consequent contains a class prediction (in this case,we are predicting whether a customer will buy a computer). R1 can also be written as-

R1: (age = youth) ^ (student = yes))->(buys computer = yes).

A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labeled data set,D, let ncovers be the number of tuples covered by R; ncorrect be the number of tuples correctly classified by R; and jDj be the number of tuples in D.

We can define the coverage and accuracy of R as

coverage(R) =ncovers/D

Accuracy(R) =ncorrect/D

Rule Extraction from a Decision Tree

To extract rules from a decision tree, one rule is created for each path from the root to a leaf node. Each splitting criterion along a given path is logically ANDed to form the rule antecedent (“IF” part). The leaf node holds the class prediction, forming the rule consequent (“THEN” part).

Algorithm: Sequential covering. Learn a set of IF-THEN rules for classification.

Input:

D, a data set class-labeled tuples;

Att vals, the set of all attributes and their possible values.

Output: A set of IF-THEN rules.

Method:

(1) Rule set = fg; // initial set of rules learned is empty

(2) for each class c do

(3) repeat

(4) Rule = Learn One Rule(D, Att vals, c);

(5) remove tuples covered by Rule from D;

(6) until terminating condition;

(7) Rule set = Rule set +Rule; // add new rule to rule set

(8) endfor

(9) return Rule Set;

Examples of set D are

3) Classification by Back propagation

Back propagation is a neural network learning algorithm. Roughly speaking, a neural network is a set of connected input/output units in which each connection has a weight associated with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples. Neural network learning is also referred to as connectionist learning due to the connections between units
Advantages of neural networks, however, include their high tolerance of noisy data as well as their ability to classify patterns on which they have not been trained. They can be used when you may have little knowledge of the relationships between attributes and classes.
The back propagation algorithm performs learning on a multilayer feed-forward neural network. It iteratively learns a set of weights for prediction of the class label of tuples. A multilayer feed-forward neural network consists of an input layer, one or more hidden layers, and an output layer.
Each layer is made up of units. The inputs to the network correspond to the attributes measured for each training tuple. The inputs are fed simultaneously into the units making up the input layer. These inputs pass through the input layer and are then weighted and fed simultaneously to a second layer of “neuronlike” units, known as a hidden layer. The outputs of the hidden layer units can be input to another hidden layer, and so on.
The number of hidden layers is arbitrary, although in practice, usually only one is used. The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network’s prediction for given tuples.The units in the input layer are called input units. The units in the hidden layers and output layer are sometimes referred to as neurodes, due to their symbolic biological basis,or as output units.

STEPS:-

Defining Network Topology

Before training can begin, the user must decide on the network topology by specifying the number of units in the input layer, the number of hidden layers(if more than one), the number of units in each hidden layer, and the number of units in the output layer.

Backpropagation

Back propagation learns by iteratively processing a set of training samples, comparing the network’s predicition for each sample with the actual known class label. For each training sample, the weights are modified so as to minimize the mean squared error between the network’s prediction and the actual class. These modifications are made in the “backwards” direction, that is , form the output layer through each hidden layer down to the first hidden layer (hence the name backpropagation).

Initialize the weights. The weights in the network are initialized to small random number(e.g., ranging from -1.0 to1.0,or -0.5 to 0.5).

Propagate the inputs forward

In this step,the net input and output of each unit in the hidden and output layers are computed.

Q) WHAT ARE LAZY AND EAGER LEARNERS?

Lazy vs. Eager Learning

n Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple

n Eager learning (the above discussed methods): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify

n Lazy: more time in predicting

n Accuracy: More in lazy learners.

1)The k-Nearest Neighbor Algorithm

Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test tuple with training tuples that are similar to it. The training tuples are described by n attributes. Each tuple represents a point in an n-dimensional space. In this way, all of the training tuples are stored in an n-dimensional pattern space. When given an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k training tuples that are closest to the unknown tuple. These k training tuples are the k “nearest neighbors” of the unknown tuple.

“Closeness” is defined in terms of a distance metric, such as Euclidean distance.

The Euclidean distance between two points or tuples, say, X1 = (x11, x12, : : : , x1n) and

X2 = (x21, x22, : : : , x2n), is

2)Case-Based Reasoning (CBR)

CBR: Uses a database of problem and solutions to solve new problems
Store symbolic description (tuples or cases)—not points in a Euclidean space
Applications: Customer-service (product-related diagnosis), legal ruling
Methodology

1) Instances represented by rich symbolic descriptions (e.g., function graphs)

2) Search for similar cases, multiple retrieved cases may be combined

3) Tight coupling between case retrieval, knowledge-based reasoning, and problem solving

Challenges

1) Find a good similarity metric

2) Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases

Other Classification Methods

Genetic Algorithms (GA)

Genetic algorithms attempt to incorporate ideas of natural evolution. In general, genetic learning starts as follows. An initial population is created consisting of randomly generated rules. Each rule can be represented by a string of bits.

As a simple example, suppose that samples in a given training set are described by two Boolean attributes, A1 and A2, and that there are two classes,C1 andC2. The rule “IF A1 ANDNOT A2 THENC2” can be encoded as the bit string “100,” where the two leftmost bits represent attributes A1 and A2, respectively, and the rightmost bit represents the class. Similarly, the rule “IF NOT A1 AND NOT A2 THENC1” can be encoded as “001.”

Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules in the current population, as well as offspring of these rules.

fitness of a rule is assessed by its classification accuracy on a set of training samples.

Offspring are created by applying genetic operators such as crossover and mutation. In crossover, substrings from pairs of rules are swapped to form new pairs of rules. In mutation, randomly selected bits in a rule’s string are inverted.

PREDICTION

What Is Prediction?

n Prediction is different from classification because:-

1) Classification refers to predict categorical class label(yes/no)

2) Prediction models continuous-valued functions(salary)

Major method for prediction: Regression

It models the relationship between one or more independent or predictor variables and a dependent or response variable.

Ø Linear Regression

Straight-line regression analysis involves a response variable, y, and a single predictor variable, x. It is the simplest form of regression, and models y as a linear function of x.

That is, y = b+wx; where the variance of y is assumed to be constant, and b and w are regression coefficients specifying the Y-intercept and slope of the line, respectively.

>>>>>Refer to the numerical solved in class…………………………………..>

Ø Nonlinear Regression

1) Some nonlinear models can be modeled by a polynomial function

2) A polynomial regression model can be transformed into linear regression model. For example,

y = w₀ + w₁ x + w₂ x²+ w₃ x³

convertible to linear with new variables: x₂= x², x₃= x³

y = w₀ + w₁ x + w₂ x₂+ w₃ x₃

What are Classifier Accuracy Measures? Explain Confusion Matrix?

1) The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier.

2) The confusion matrix is a useful tool for analyzing how well your classifier can recognize tuples of different classes. Given two classes, we can talk in terms of positive tuples (tuples of the main class of interest, e.g., buys computer = yes) versus negative tuples (e.g., buys computer = no).

3) True positives refer to the positive tuples that were correctly labeled by the classifier, while true negatives are the negative tuples that were correctly labeled by the classifier.

4) False positives are the negative tuples that were incorrectly labeled (e.g., tuples of class

buys computer = no for which the classifier predicted buys computer = yes

	C₁	C₂
C₁	True positive	False negative
C₂	False positive	True negative

Features of Confusion Matrix

1) Accuracy of a classifier M, acc(M): percentage of test set tuples that are correctly classified by the model M

1. Error rate (misclassification rate) of M = 1 – acc(M)

2. Given m classes, CM_i,j, an entry in a confusion matrix, indicates no. of tuples in class i that are labeled by the classifier as class j

2) Alternative accuracy measures

a) Sensitivity = t-pos/pos /* true positive recognition rate */

b) Specificity = t-neg/neg /* true negative recognition rate */

c) Precision = t-pos/(t-pos + f-pos)

d) Accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)

This model can also be used for cost-benefit analysis

What are the Predictor Error Measures?

Measure predictor accuracy: measure how far off the predicted value is from the actual known value
Loss function: measures the error between. y_i and the predicted value y_i’

a) Absolute error: | y_i – y_i’|

b) Squared error: (y_i – y_i’)²

Q) How can we evaluate the Accuracy of a Classifier or Predictor?

Ans) There are various methods of evaluating accuracy of a classifier or predictor.

Holdout Method and Random Sub sampling

Ø In this method, the given data are randomly partitioned into two independent sets, a training set and a test set. Typically, two-thirds of the data are allocated to the training set, and the remaining one-third is allocated to the test set. The training set is used to derive the model, whose accuracy is estimated with the test set

Ø Random sub sampling is a variation of the holdout method in which the holdout method is repeated k times. The overall accuracy estimate is taken as the average of the accuracies obtained from each iteration.

Cross-validation

Ø In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or “folds,” D1, D2, : : : , Dk, each of approximately equal size. Training and testing is performed k times. In iteration i, partition Di is reserved as the test set, and the remaining partitions are collectively used to train the model. That is, in the first iteration, subsets D2, : : : , Dk collectively serve as the training set in order to obtain a first model, which is tested on D1; the second iteration is trained on subsets D1, D3, : : : , Dk and tested on D2; and so on.

Bootstrap

Ø Unlike the accuracy estimation methods mentioned above, the bootstrap method samples the given training tuples uniformly with replacement. That is, each time a tuple is selected, it is equally likely to be selected again and re added to the training set. For instance, imagine a machine that randomly selects tuples for our training set. In sampling with replacement, the machine is allowed to select the same tuple more than once.

There are several bootstrap methods. A commonly used one is the .632 bootstrap, which works as follows. Suppose we are given a data set of d tuples. The data set is sampled d times, with replacement, resulting in a bootstrap sample or training set of d samples. It is very likely that some of the original data tuples will occur more than once in this sample. The data tuples that did not make it into the training set end up forming the test set

Notes 4 Learners

DWDM Unit-3

2 comments:

FEEDBACK

Blog Archive