Pactical-1

Aim: Data Preprocessing using scikitLearn python library

Theory:

Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format.

Steps Involved in Data Preprocessing:-

1. Data Cleaning:

2. Data Transformation:

3. Data Reduction:

Importance of Data Preprocessing :

A couple goes into a medical clinic for a pregnancy test — both the man and lady need to experience the test. When the pregnancy results return, they propose that the man is pregnant. Truly odd, isn't that so?

Presently attempt to relate this to an AI issue — classification. We have 1000+ couples' pregnancy test data, and for 60% of the information, we know who's pregnant. For the staying 40% we have to anticipate the outcomes based on recently recorded tests. Suppose, out of these 60%, 1% proposes that the man is pregnant.

While building an AI model, on the off chance that we haven't done any pre-preparing like correcting outliers, dealing with missing values , normalization and scaling of data , we may wind up considering those 1% of outcomes that are false.

An AI model or machine learning model is just some code; a data scientist make it smart with some pre recorded data and training with those data. so if we give the garbage value or false value then in return we get the false prediction.the model will give false or wrong predictions for the people (40%) whose results are unknown.

There are common or standard tasks that you may use or explore during the data preparation step in a machine learning project. These tasks include:

· Data Cleaning: Identifying and correcting mistakes or errors in the data.

· Feature Selection: Identifying those input variables that are most relevant to the task.

· Data Transforms: Changing the scale or distribution of variables.

· Feature Engineering: Deriving new variables from available data.

· Dimensionality Reduction: Creating compact projections of the data

Fill Missing Values With Imputation:

Real-time data often has missing values.There are many reasons of having missing values in a data, such as observations that were not recorded and data corruption. Many machine learning algorithms do not support data with missing values so handling missing data is very important.

Filling missing values with data is called data imputation and one of the popular approach for data imputation is to calculate a statistical value for each attribute (such as a mean, median) and replace all missing values for that column with the statistics.

Dataset : The horse colic dataset

Url : https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv

The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died.

The Dataset contains missing values marked with a question mark ‘?’. We load the dataset with the read csv() function and ensure that question mark values are marked as NaN. Once loaded, we uses the function of sklearn SimpleImputer class to transform all missing values marked with a NaN value with the mean of the column.

Scaling Data With Normalization

Different machine learning algorithms performance increases when numerical input variables are scaled to a standard range. Among many different algorithms sch as linear regression that use a weighted sum of the input, gives better performance with scaled data and algorithms that use distance measures, like k-nearest neighbors.

One of the best techniques for scaling numerical data prior to modeling is normalization. Normalization scales each input variable separately to the range 0-1, which is the range for floating-point values which has the most precision. With the help of normalization we can able to accurately estimate the minimum and maximum observable values for each variable. You can normalize your dataset using the scikit-learn object MinMaxScaler. The example below defines a synthetic classification dataset, then uses the MinMaxScaler to normalize the input variables.

One Hot Encoding

Machine learning models require input and output attributes in the numeric form. This means that if the data has categorical data, we must encode it in numerical form before we can fit and evaluate a model. One of the best techniques for transforming categorical variables into numerical form is the one hot encoding. Categorical data are variables that contain label values rather than numeric values.

Each label for a categorical variable can be mapped to a unique number(integer), called an ordinal encoding. Then, a one hot encoding can be applied to the ordinal representation.

This is where one new binary variable is added to the dataset for each unique integer value in the variable, and the original categorical variable is removed from the dataset. For example, imagine we have a color variable with three categories (red, green, and blue). In this case, three binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

This one hot encoding transform is available in the scikit-learn Python machine learning library via the OneHotEncoder class. The breast cancer dataset contains only categorical input variables. The example below loads the dataset and one hot encodes each of the categorical input variables.

Search This Blog

Yash Radadiya

Pactical-1

Comments

Post a Comment