When is it necessary to split a dataset for Analysis? Is it before, or after we clean the data? That is the question.

Data Cleaning What is data? First and foremost, let us find out what Data is and is not. Data, in its simplest form, is raw, unprocessed facts and figures. It can be numbers, text, images, or any other form of information that can be stored and processed by computers. Data becomes meaningful information when it is analyzed, interpreted, and placed in context. Therefore, Data is the basic unit of information before it's been organized, analyzed, or interpreted. What is information? Information on the other hand is the result of taking the raw data and transforming it into a meaningful, usable format for analysis. The process may involve, interpreting, organizing, and contextualizing data. In data science, splitting your dataset effectively is an important initial step towards building a robust model. Generally, you'll want to allocate a larger portion of your data for training, very often around 70%-80%, with the remaining 20%-30% for testing. This allows the model to learn from a substantial amount of data while still retaining enough unique data points to test its predictions. However, this split may depend on the data size and its diversity. For smaller datasets, you may need to use techniques like cross-validation to maximize the use of your data for training while still getting a reliable estimate of model performance. Types of Data: Data can be broadly categorized as qualitative (descriptive, non-numerical) or quantitative (numerical, measurable). In Machine Machine, Data Analysis, and Data Science, It's generally and also highly recommended that we split the dataset before you start cleaning and pre-processing it. This helps prevent data leakages, where information from the test set influences the training set. For example, if you scale data before splitting, the scaling parameters (like mean and standard deviation) would be calculated using the entire dataset, including the test set, which can compromise the model's ability to generalize for the unseen data. Explanation: 1. Avoiding Data Leakage: Splitting before cleaning ensures that the training set is independent of the test set. This prevents the model from "seeing" or making information from the test set during training "visible", which could lead to overfitting, (overfitting in machine learning occurs when a model learns the training data too well, including its noise and outliers, resulting in a model that performs poorly on new, unseen data. Essentially, the model memorizes the training data instead of learning the underlying patterns, leading to inaccurate predictions on new data.) and poor performance on new, unseen data. 2. Global Pre-processing: Some cleaning and pre-processing steps, like handling missing values or feature engineering, are done globally across the entire dataset. These steps should be performed before splitting to avoid inconsistencies between the training and test sets. 3. Local Pre-processing: Other pre-processing steps, like scaling (Data scaling is the process of transforming numerical data values to a specific range or distribution, often done to improve the performance of machine learning models. It's essentially about making all your data on the same scale, whether it's 0-1, or with a mean of 0 and a standard deviation of 1.), are often done locally, meaning they are performed separately on the training and test sets. These steps should be performed after splitting to avoid data leakage. 4. Why Split Before? Splitting before cleaning and pre-processing allows you to use the training set to learn the necessary transformations and then apply those same transformations to the test set. This ensures that the model is evaluated on data that it has not "seen" during training. Now, let us explore the possibilities, and what would be considered the best option and practice. Data pre-processing involves cleaning and transforming raw data into a usable format for analysis, improving accuracy, and efficiency to help you get a higher score for Machine Learning outcome. It addresses issues like missing values, inconsistencies, and outliers in the data, preparing that data for subsequent tasks like machine learning and model training outcomes. 1. For what Purpose?: a) Improve Data Quality: Addressing inaccuracies,inconsistencies, and errors in the data. b) Enhance Model Performance: Preparing data for machine learning algorithms helps make it easier for the algorithm to understand and learn. c) Streamline Analysis: Ensuring data is in a format suitable for analysis and visualization - so it is in the right consumable format. 2. Key Steps: a) Data Cleaning: i) Handling Missing Values: Imputing or removing missing data points. ii) Identifying and Correcting Errors: Addressing inconsistencies, outliers, and other data quality issues. iii) Removing Duplicates: Ensuring each record is unique. b) Data Tran

May 5, 2025 - 18:20

When is it necessary to split a dataset for Analysis? Is it before, or after we clean the data? That is the question.

Data Cleaning

**
What is data?
First and foremost, let us find out what Data is and is not.

Data, in its simplest form, is raw, unprocessed facts and figures. It can be numbers, text, images, or any other form of information that can be stored and processed by computers. Data becomes meaningful information when it is analyzed, interpreted, and placed in context. Therefore, Data is the basic unit of information before it's been organized, analyzed, or interpreted.

What is information?
Information on the other hand is the result of taking the raw data and transforming it into a meaningful, usable format for analysis. The process may involve, interpreting, organizing, and contextualizing data.

In data science, splitting your dataset effectively is an important initial step towards building a robust model. Generally, you'll want to allocate a larger portion of your data for training, very often around 70%-80%, with the remaining 20%-30% for testing. This allows the model to learn from a substantial amount of data while still retaining enough unique data points to test its predictions.
However, this split may depend on the data size and its diversity. For smaller datasets, you may need to use techniques like cross-validation to maximize the use of your data for training while still getting a reliable estimate of model performance.

Types of Data:
Data can be broadly categorized as qualitative (descriptive, non-numerical) or quantitative (numerical, measurable).

In Machine Machine, Data Analysis, and Data Science, It's generally and also highly recommended that we split the dataset before you start cleaning and pre-processing it. This helps prevent data leakages, where information from the test set influences the training set.

For example, if you scale data before splitting, the scaling parameters (like mean and standard deviation) would be calculated using the entire dataset, including the test set, which can compromise the model's ability to generalize for the unseen data.

Explanation:

1. Avoiding Data Leakage:
Splitting before cleaning ensures that the training set is independent of the test set. This prevents the model from "seeing" or making information from the test set during training "visible", which could lead to overfitting, (overfitting in machine learning occurs when a model learns the training data too well, including its noise and outliers, resulting in a model that performs poorly on new, unseen data. Essentially, the model memorizes the training data instead of learning the underlying patterns, leading to inaccurate predictions on new data.) and poor performance on new, unseen data.

2. Global Pre-processing:
Some cleaning and pre-processing steps, like handling missing values or feature engineering, are done globally across the entire dataset. These steps should be performed before splitting to avoid inconsistencies between the training and test sets.

3. Local Pre-processing:
Other pre-processing steps, like scaling (Data scaling is the process of transforming numerical data values to a specific range or distribution, often done to improve the performance of machine learning models. It's essentially about making all your data on the same scale, whether it's 0-1, or with a mean of 0 and a standard deviation of 1.), are often done locally, meaning they are performed separately on the training and test sets. These steps should be performed after splitting to avoid data leakage.

4. Why Split Before?
Splitting before cleaning and pre-processing allows you to use the training set to learn the necessary transformations and then apply those same transformations to the test set. This ensures that the model is evaluated on data that it has not "seen" during training.

Now, let us explore the possibilities, and what would be considered the best option and practice.

Data pre-processing involves cleaning and transforming raw data into a usable format for analysis, improving accuracy, and efficiency to help you get a higher score for Machine Learning outcome. It addresses issues like missing values, inconsistencies, and outliers in the data, preparing that data for subsequent tasks like machine learning and model training outcomes.

1. For what Purpose?:
a) Improve Data Quality: Addressing inaccuracies,inconsistencies, and errors in the data.

b) Enhance Model Performance: Preparing data for machine learning algorithms helps make it easier for the algorithm to understand and learn.

c) Streamline Analysis: Ensuring data is in a format suitable for analysis and visualization - so it is in the right consumable format.

2. Key Steps:
a) Data Cleaning:
i) Handling Missing Values: Imputing or removing
missing data points.
ii) Identifying and Correcting Errors: Addressing
inconsistencies, outliers, and other data quality
issues.
iii) Removing Duplicates: Ensuring each record is
unique.

b) Data Transformation:
i) Feature Scaling: Normalizing or standardizing
numerical features to a common scale.
ii) One-Hot Encoding: Converting categorical data into
numerical representations.
iii) Data Transformation: Applying functions to modify
the values of features, e.g., taking logarithms or
square roots.

c) *Feature Engineering: *
i) Is the creation of new features from existing ones to
help improve model performance.
ii) Data Integration: Combining data from multiple
sources into a single dataset for easy manipulation
and analysis.
iii) Data Reduction: Reducing the dimensionality of the
data to improve model efficiency.

3. Examples:
a) Filling Missing Values:
Replacing missing values with the mean, median, or a
predicted value based on other features.

b) Removing Outliers:
Identifying and removing data points that are
significantly different from the rest of the data.

c) Scaling Data:
Transforming numerical features to a common scale (e.g.,
0-1 or -1-1) using techniques like Min-Max Scaling or
Standardization.

  ```
  #Scaling Technique using Min-Max Scaling Technique
  #import the needed packages and libraries
  import pandas as pd
  from sklearn.preprocessing import MinMaxScaler

  # A Sample DataFrame
  data = {'column1': [100, 200, 300, 400, 500], 
    'column2': [101, 202, 303, 404, 505]}
  dataframe = pd.DataFrame(data)

  # Initialize MinMaxScaler function
  scaler = MinMaxScaler()

  # Fit and transform the desired dataset columns
  dataframe[['column1', 'column2']] = 
  scaler.fit_transform(dataframe[['column1', 'column2']])

  #print(dataframe)
  dataframe
  ```

d) Encoding Categorical Data:
Converting categorical features into numerical
representations, for example, using one-hot encoding.

  `#import the needed packages and libraries
   from sklearn.preprocessing import LabelEncoder
   import pandas as pd

   #generate a dataframe
   data = {'Gender': ['Male', 'Female', 'Female', 'Male']}
   dataframe = pd.DataFrame(data)

   #encode the datasets
   gender = LabelEncoder()
   dataframe['Gender_Encoded'] = 
   gender.fit_transform(dataframe['Gender'])

   #print(dataframe)
   dataframe`

4. Why do we even care?
a) Improved Model Accuracy: Pre-processing can
significantly improve the accuracy of machine learning
models.

b) Enhanced Model Performance: Pre-processing can make
models faster and more efficient.

c) Better Interpretability: Cleaned and transformed data is
easier to understand and interpret.

When is the best time to do Feature Selection?

In a dataset, a feature is a measurable property or characteristic of the data points. It's also known as a variable or attribute, representing a definable quality that can vary within the dataset. Features can be used to describe and understand the data, and they are often used as inputs to machine learning models.

Key aspects of features:
a) Measurable properties: Features are quantifiable
characteristics, like age, height, or temperature.
b) Variables: Their values can change from one data point
to another.
c) Attributes: These describe the data points in a dataset.
d) Inputs to models: In machine learning, features are
often used as inputs to train and predict outcomes.

Examples of Features:

a) In a medical dataset:
Features could include patient age, gender, blood pressure, cholesterol levels, etc.

b) In a weather dataset:
Features could include temperature, humidity, wind speed, cloud coverage, etc.

c) In a student performance dataset:
Features could include student attendance, student grade, student, attendance, age, student GPA, etc.

d) In a dataset of employee records:
Features could include age, location, salary, title, performance metrics, etc., according to IBM.

Feature selection is an important step in machine learning which involves selecting a subset of relevant Features from the original Feature set to reduce the Feature space while improving the model’s performance by reducing computational power. It’s a critical step in the machine learning especially when dealing with high-dimensional data.

When should feature selection be done?

Perform feature selection during the model training process. Feature selection is integrated into the model training to allow the model to select the most relevant features based on the training process dynamically. according to Geeks for Geeks

Feature selection helps by improving the model’s accuracy instead of random guessing based on all Features and increased interpretability.

The best answer is to do Feature selection after splitting data otherwise there could be information leakage, if it is done before from the Test-Set.

Alternatively, if the Feature selection for any particular work changes, then no generalization of Feature Importance can be done, which is not desirable.

If only the Training-Set is used for Feature selection then the test-set may contain certain datasets which may defy or contradict the Feature selection done on the Training-Set as the overall historical data is not analyzed.

Using only the training set for feature selection is generally not recommended because it can lead to an overly optimistic model and potentially poor generalization on unseen data. Feature selection should ideally be performed on the training set only to prevent information leakage and maintain an unbiased evaluation of the model's performance.

In machine learning, feature importance scores are used to determine the relative importance of each feature in a dataset when building a predictive model. These scores are calculated using a variety of techniques, such as decision trees, random forests, linear models, and neural networks.

How do you evaluate feature importance?
The feature importance is calculated by noticing the increase or decrease in error when we permute the values of a feature. If permuting the values causes a huge change in the error, it means the feature is important for our model otherwise it isn't.