Pactical-1
Aim: Data Preprocessing using scikitLearn
python library
Theory:
Data preprocessing is a data mining technique which
is used to transform the raw data in a useful and efficient format.
Steps Involved in Data Preprocessing:-
1. Data Cleaning:
2. Data Transformation:
3. Data Reduction:
Importance of Data Preprocessing :
A couple goes into a
medical clinic for a pregnancy test — both the man and lady need to experience
the test. When the pregnancy results return, they propose that the man is
pregnant. Truly odd, isn't that so?
Presently attempt to relate this to an AI issue — classification. We have 1000+
couples' pregnancy test data, and for 60% of the information, we know who's
pregnant. For the staying 40% we have to anticipate the outcomes based on
recently recorded tests. Suppose, out of these 60%, 1% proposes that the man is
pregnant.
While building an AI model, on the off chance that we haven't done any
pre-preparing like correcting outliers, dealing with missing values ,
normalization and scaling of data , we may wind up considering those 1% of
outcomes that are false.
An AI model or machine learning model is just some code; a data scientist make
it smart with some pre recorded data and training with those data. so if we
give the garbage value or false value then in return we get the false
prediction.the model will give false or wrong predictions for the people (40%)
whose results are unknown.
There are common or standard tasks that you may use or explore during
the data preparation step in a machine learning project. These tasks
include:
·
Data Cleaning: Identifying and correcting mistakes or errors in the
data.
·
Feature Selection: Identifying those input variables that are most
relevant to the task.
·
Data Transforms: Changing the scale or distribution of variables.
·
Feature Engineering: Deriving new variables from available data.
·
Dimensionality Reduction: Creating compact projections of the data
Fill Missing Values With Imputation:
Real-time data often has missing values.There are many reasons of
having missing values in a data, such as observations that were not
recorded and data corruption. Many machine learning algorithms do not support
data with missing values so handling missing data is very important.
Filling missing values with data is called data imputation and one of
the popular approach for data imputation is to calculate a statistical value
for each attribute (such as a mean, median) and replace all missing values for
that column with the statistics.
Dataset : The horse colic dataset
Url : https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv
The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died.
The Dataset contains missing values marked with a question mark
‘?’. We load the dataset with the read csv() function and ensure that question
mark values are marked as NaN. Once loaded, we uses the function of
sklearn SimpleImputer class to transform all missing values marked with a
NaN value with the mean of the column.
Scaling Data With Normalization
Different machine learning algorithms performance increases when
numerical input variables are scaled to a standard range. Among many
different algorithms sch as linear regression that use a weighted sum of the
input, gives better performance with scaled data and algorithms that use
distance measures, like k-nearest neighbors.
One Hot Encoding
Machine
learning models require input and output attributes in the numeric form.
This means that if the data has categorical data, we must encode it in
numerical form before we can fit and evaluate a model. One of the best
techniques for transforming categorical variables into numerical form is the
one hot encoding. Categorical data are variables that contain label values
rather than numeric values.
Each label for
a categorical variable can be mapped to a unique number(integer), called an
ordinal encoding. Then, a one hot encoding can be applied to the ordinal
representation.
This is where
one new binary variable is added to the dataset for each unique integer value
in the variable, and the original categorical variable is removed from the
dataset. For example, imagine we have a color variable with three categories
(red, green, and blue). In this case, three binary variables are needed. A “1”
value is placed in the binary variable for the color and “0” values for the
other colors.
This one hot
encoding transform is available in the scikit-learn Python machine learning
library via the OneHotEncoder class. The breast cancer dataset contains only
categorical input variables. The example below loads the dataset and one hot
encodes each of the categorical input variables.
Comments
Post a Comment