Practical-4

 Aim: Visual Programming with orange tool

Theory:

The information in the csv file (panda dataframes) is generally hard to approach in the event to get a few insights. It doesn't make a difference if the information is designed or not organized effectively. 

According to SaS Data Visualization's webpage.

The poring over spreadsheets or reports is quite difficult than the way in which human brain process information using charts or graphs to visualize large amounts of complex data. 

Visualization impacts modeling from different, yet EDA(Explloratory Data Analysis) Phase is more convinient, when the need is to demonstrate or understand some patterns in the data.

 

Data Sampler Widget is used to split the data in Orange Tool

Data Sampler

Inputs

    1. Data: Input dataset

Outputs

    1. Data Sample: Sampled data instances

    2. Remaining Dara: out-of-sample data

Data Sampler widget implements many different data sampling methods. It gives a sampled and a complementary dataset as the output. This output is then processed after the input dataset is provided after  the input dataset is provided and Sample Data is processed.



  • Information on the input and output dataset.
  • The desired sampling method:
    • Cross Validation partitions data instances into the specified number of complementary subsets. Following a typical validation schema, all subsets except the one selected by the user are output as Data Sample, and the selected subset goes to Remaining Data. (Note: In older versions, the outputs were swapped. If the widget is loaded from an older workflow, it switches to compatibility mode.)
    • Fixed sample size returns a selected number of data instances with a chance to set Sample with replacement, which always samples from the entire dataset .
    • Bootstrap infers the sample from the population statistic.
    • Fixed proportion of data returns a chosen percentage of the entire data (e.g. 70% of all the data)
  • Press Sample Data to output the data sample

Example: 

    Cross Validation:
        Models can be built in different ways. The most difficult procedure is cross validation. It splits the data into k folds and uses k - 1 folds for training and the remaining fold for testing. This method is looped, so that every fold will be used for testing exactly once.

  • Now, we will use the Data Sampler to split the data into training and testing part. We are using the Pima Diabetes dataset , which we loaded with the File widget. 
  • In Data Sampler, we split the data with cross validation, keeping 10 used subset  in the sample.
  • Then we connected Data sampler -> Test and score. And then we add Logistic Regression as a learner, Logistic Regession -> Test and score




Fixed Sample Size:

  • First, let’s see how the Data Sampler works. We will use the Pima Diabetes dataset from the File widget.
  • We see there are 768 instances in the data. We sampled the data with the Data Sampler widget 
  • We chose to go with a fixed sample size of 5 instances. 
  • We can observe the sampled data in the Data Table widget. 
  • The second Data Table(out of sample) shows the remaining 307 instances that weren’t in the sample. To output the out-of-sample data, double-click the connection between the widgets and rewire the output to Remaining Data -> Data.

Fixed Proportion: 

            The desired proportion can be made to biforget the data. Let's say keeping the proportin to 70%, the dataset will be divided into two part, 70% and 30% respectively.











Comments

Popular posts from this blog

ReactNative