Practical-2

Aim: Perform following Data Pre-processing (Feature Selection/Elimination) tasks using python 

Theory:-

In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features for use in model construction.

Why is it important(Advantages):-

    1. It enables the machine learning algorithm to train faster. 

    2. It reduces the complexity of a model and makes it easier to interpret. 

    3. It improves the accuracy of  a model if the right subset is chosen. 

    4. It reduces overfitting.

    5. It is very effecient and fast to compute.

 

Disadvantages of Feature Selection:-


1. A feature that is not useful by itself can be very useful when combined with others. Feature selection misses it.

 

Various Data pre-processing techniques:-

DataSet: Diabetes dataset

Url to direct import data: https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv  

        The dataset is used to predict whether the patient is having diabetes or not.


   

Data reduction using variance threshold:-

        It removes all features whose variance doesn't meet some threshold. By default it removes features with zero variance or features thar have th same value for all samples.

 

    Univariate feature selection:-

      Univariate feature selection works by selecting the best features based on univariate statistical tests. it can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects the implement the transform method.




Recursive feature elimination:-

        Given an external estimator that assigns weights to features, recursive feature elimination is to select features by recursively considering smaller and smaller sets of feature. First, the estimator is trained on the initial set of features and the importance of each feature is obtained. Then, the least important features are pruned set until the desired number of features to select is eventually reached.




PCA:-

        Principal component analysis is a rotation of data from one coordinate system to another. PCA should not be applied if the variables don't belong on a coordinate plane.



Correlation:-

        Correlation is a statistical term which in common usage refers to how close two variables are to having a linear relationship with each other. Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features.



Comments

Popular posts from this blog

ReactNative