Outlier detection and handling in R
An outlier is a data point that significantly differs from other observations in a dataset. It can be: - Unusually high or low compared to the rest of the data. - Anomalous due to measurement errors, data entry mistakes, or rare events. - A true extreme value that represents natural variation. Example of data set name of the loaded data is Data How to identify outliers basic summary function summary(Data) output Visual methods (using Box plot) Plot age on a box plot boxplot(Data$Age, main = "Age",col = "skyblue") output Plot Net_worth on a box plot boxplot(Data$Net_worth, main ="Networth in *10000 PLN",col = "orange") output Using interquatile range Identify the outlier on age values Q1

An outlier is a data point that significantly differs from other observations in a dataset. It can be:
- - Unusually high or low compared to the rest of the data.
- - Anomalous due to measurement errors, data entry mistakes, or rare events.
- - A true extreme value that represents natural variation.
Example of data set
How to identify outliers
- basic summary function
summary(Data)
output
- Visual methods (using Box plot)
Plot age on a box plot
boxplot(Data$Age, main = "Age",col = "skyblue")
output
![]()
Plot Net_worth on a box plot
boxplot(Data$Net_worth, main ="Networth in *10000 PLN",col = "orange")
output
- Using interquatile range
Identify the outlier on age values
Q1 <- quantile(Data$Age, 0.25)
Q3 <- quantile(Data$Age, 0.75)
IQR <- Q3 - Q1
lower_bound_age <- Q1 - 1.5 * IQR
upper_bound_age <- Q3 + 1.5 * IQR
outlier_age <- Data$Age[Data$Age < lower_bound_age | Data$Age > upper_bound_age]
print(outlier_age)
output
93
Identify the outlier on Net_worth values
Q1 <- quantile(Data$Net_worth, 0.25)
Q3 <- quantile(Data$Net_worth, 0.75)
IQR <- Q3 - Q1
lower_bound_Net_worth <- Q1 - 1.5 * IQR
upper_bound_Net_worth <- Q3 + 1.5 * IQR
outlier_networth <- Data$Net_worth[Data$Net_worth < lower_bound_Net_worth | Data$Net_worth > upper_bound_Net_worth]
print(outlier_networth)
output
152000
SOLVING THE OUTLIER
- Droping the outliers using the interquartile range
new_data <- Data[
Data$Net_worth >= lower_bound_Net_worth & Data$Net_worth <= upper_bound_Net_worth &
Data$Age >= lower_bound_age & Data$Age <= upper_bound_age,
]
summary(new_data)
output
- Substituting the outliers with column mean
identify the row index for outliers
# check the data row
which(Data$Net_worth== 152000 )
which(Data$Age== 93)
output
12, 10Replace the outliers with the means
#Replace the data points with the mean
Data$Net_worth[12] <- mean(Data$Net_worth)
Data$Age[10] <- mean(Data$Age)
summary(Data)
plot the new data columns on a box plot
boxplot(Data$Age,
main = "Age",
col = "green",
border = "blue")
Output
boxplot(Data$Net_worth,
main = "Networth in *10000 PLN",
col = "yellow",
border = "blue")
output