Outlier detection and handling in R

An outlier is a data point that significantly differs from other observations in a dataset. It can be: - Unusually high or low compared to the rest of the data. - Anomalous due to measurement errors, data entry mistakes, or rare events. - A true extreme value that represents natural variation. Example of data set name of the loaded data is Data How to identify outliers basic summary function summary(Data) output Visual methods (using Box plot) Plot age on a box plot boxplot(Data$Age, main = "Age",col = "skyblue") output Plot Net_worth on a box plot boxplot(Data$Net_worth, main ="Networth in *10000 PLN",col = "orange") output Using interquatile range Identify the outlier on age values Q1

Apr 29, 2025 - 09:46
 0
Outlier detection and handling in R

An outlier is a data point that significantly differs from other observations in a dataset. It can be:

  1. - Unusually high or low compared to the rest of the data.
  2. - Anomalous due to measurement errors, data entry mistakes, or rare events.
  3. - A true extreme value that represents natural variation.

Example of data set

name of the loaded data is Data
Image description

How to identify outliers

  1. basic summary function
summary(Data)

output

Image description

  1. Visual methods (using Box plot)

Plot age on a box plot
boxplot(Data$Age, main = "Age",col = "skyblue")
output
Image description

Plot Net_worth on a box plot

boxplot(Data$Net_worth, main ="Networth in *10000 PLN",col = "orange")

output

Image description

  1. Using interquatile range

Identify the outlier on age values

Q1 <- quantile(Data$Age, 0.25)
Q3 <- quantile(Data$Age, 0.75)
IQR <- Q3 - Q1
lower_bound_age <- Q1 - 1.5 * IQR
upper_bound_age <- Q3 + 1.5 * IQR
outlier_age <- Data$Age[Data$Age < lower_bound_age | Data$Age > upper_bound_age]
print(outlier_age)

output
93

Identify the outlier on Net_worth values

Q1 <- quantile(Data$Net_worth, 0.25)
Q3 <- quantile(Data$Net_worth, 0.75)
IQR <- Q3 - Q1
lower_bound_Net_worth <- Q1 - 1.5 * IQR
upper_bound_Net_worth <- Q3 + 1.5 * IQR
outlier_networth <- Data$Net_worth[Data$Net_worth < lower_bound_Net_worth | Data$Net_worth > upper_bound_Net_worth]
print(outlier_networth)

output
152000

SOLVING THE OUTLIER

  1. Droping the outliers using the interquartile range
new_data <- Data[
  Data$Net_worth >= lower_bound_Net_worth & Data$Net_worth <= upper_bound_Net_worth &
  Data$Age >= lower_bound_age & Data$Age <= upper_bound_age, 
]

summary(new_data)

output

Image description

  1. Substituting the outliers with column mean

identify the row index for outliers

# check the data row
which(Data$Net_worth== 152000 )
which(Data$Age== 93)

output
12, 10

Replace the outliers with the means

#Replace the data points with the mean
Data$Net_worth[12] <- mean(Data$Net_worth)
Data$Age[10] <- mean(Data$Age)
summary(Data)

plot the new data columns on a box plot

boxplot(Data$Age, 
        main = "Age", 
        col = "green", 
        border = "blue")

Output

Image description

boxplot(Data$Net_worth, 
        main = "Networth in *10000 PLN", 
        col = "yellow", 
        border = "blue")

output

Image description