Cleaning data in PostgreSQL

Cleaning string data. Capitalization of strings eg people's names, cities, countries. INITCAP(input_string) - fixing capitalization. To update the names of the whole column permanently use UPDATE eg UPDATE customers SET customer_name = INITCAP(customer_name); 2.Removing spaces in strings. REPLACE(input_string, to_replace, replacement) 3.Makes strings have uniform length. LPAD(input_string,length[,fill_value] SELECT LPAD(event_id, 10, '0') as event_id, -- Replace consecutive spaces with a single space REGEXP_REPLACE(INITCAP(parking_held), ' +', ' ','g') as parking_held FROM film_permit; 4.Pattern matching using the regular expressions (REs) - Basic of REs: ~ : Matches regex pattern (case-sensitive) ~* : Matches regex pattern (case-insensitive) !~ : Does not match regex pattern (case-sensitive) !~* : Does not match regex pattern (case-insensitive) Regex Metacharacters Metacharacter Meaning Example Query Returns \d Digit (0–9) SELECT '123' ~ '\\d\\d\\d'; true \w Word character (alphanumeric or underscore) SELECT 'abc_123' ~ '\\w+'; true \s Whitespace character SELECT ' ' ~ '\\s'; true . Any character except newline SELECT 'a9c' ~ 'a.c'; true ^ Anchors to the beginning of a string SELECT 'Hello' ~ '^H'; true $ Anchors to the end of a string SELECT 'Hello' ~ 'o$'; true * Matches 0 or more of the preceding character SELECT 'hoop' ~ 'hop'; true + Matches 1 or more of the preceding character SELECT 'hoop' ~ 'ho+p'; true ? Matches 0 or 1 of the preceding character SELECT 'color' ~ 'colou?r'; true [] Character class: match one of the characters SELECT 'a' ~ '[abc]'; true [^] Negated character class: not in the set SELECT 'x' ~ '[^0-9]'; true ` ` Alternation (OR) `SELECT 'dog' ~ 'cat {% raw %}() Grouping for subpatterns SELECT 'abab' ~ '(ab)+'; true Note: In SQL strings, backslashes (\) must be escaped, so you’ll often use double backslashes (\\d, \\s, etc.). Filtering Rows with Regex SELECT FROM users WHERE email ~* '^[a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,}$'; 5.Using REGEXP_REPLACE() used to replace patterns in strings. - - Its similar to REPLACE() function. Its general syntax is REGEXP_REPLACE(source,pattern,replace,flags). pattern- string pattern to match in source string. replace- replacement string to use in place of the pattern. flags- an optional string used to control matching. for instance REGEXP_REPLACE(cars,'\d','_','g') 6.Matching similar strings postgreSQl does not have function soundex first u run create extension. CREATE EXTENSION Fuzzystrmatch SOUNDEX(input_string) - 4 character code SELECT summons_number, vehicle_color FROM parking_violation WHERE -- Match SOUNDEX codes of vehicle_color and 'GRAY' DIFFERENCE(vehicle_color, 'GRAY') = 4; SELECT summons_number,vehicle_color, -- Include the DIFFERENCE() value for each color DIFFERENCE(vehicle_color, 'RED') AS "red", DIFFERENCE(vehicle_color, 'BLUE') AS "blue", DIFFERENCE(vehicle_color, 'YELLOW') AS "yellow" FROM parking_violation WHERE ( -- Condition records on DIFFERENCE() value of 4 DIFFERENCE(vehicle_color, 'RED') = 4 OR DIFFERENCE(vehicle_color, 'BLUE') = 4 OR DIFFERENCE(vehicle_color, 'YELLOW') = 4 ) SELECT summons_number, vehicle_color, DIFFERENCE(vehicle_color, 'RED') AS "red", DIFFERENCE(vehicle_color, 'BLUE') AS "blue", DIFFERENCE(vehicle_color, 'YELLOW') AS "yellow" FROM parking_violation WHERE ( DIFFERENCE(vehicle_color, 'RED') = 4 OR DIFFERENCE(vehicle_color, 'BLUE') = 4 OR DIFFERENCE(vehicle_color, 'YELLOW') = 4 -- Exclude records with 'BL' and 'BLA' vehicle colors ) AND vehicle_color NOT SIMILAR TO 'BLA?' 7.Handling missing data. Missing data can be caused by human error,systematic issues such as a failed replication process. Types of missing data. -Missing completely at random (MCAR) -Missing at Random (MAR) -Missing not at random(MNAR) Representations for missing values : Null (general) '' - empty string (used for string columns) Types of missing data 1️⃣ Missing Completely at Random (MCAR) The data is missing for no reason at all. No systematic relationship between missing data and other values. Best kind of missing data - safe to remove or ignore. 2️⃣ Missing at Random (MAR) For example - Older patient skipping recording their weight. Systematic relationship between missing data and other observed values. 3️⃣ Missing Not at Random (MNAR). The data is missing because of the missing value itself. for example; People with very high incomes don't want to share their income. Identifying missing data. Is Null is used to check whether a value is NULL. Null in databases indicates that data is either missing or not applicable which is different from an empty string or zero. ex

Apr 29, 2025 - 10:53

Cleaning string data.

Capitalization of strings eg people's names, cities, countries. INITCAP(input_string) - fixing capitalization. To update the names of the whole column permanently use UPDATE eg

UPDATE customers
SET customer_name = INITCAP(customer_name);

2.Removing spaces in strings.
REPLACE(input_string, to_replace, replacement)

3.Makes strings have uniform length. LPAD(input_string,length[,fill_value]

SELECT 
    LPAD(event_id, 10, '0') as event_id, 
    -- Replace consecutive spaces with a single space
    REGEXP_REPLACE(INITCAP(parking_held), ' +', ' ','g')  as 
    parking_held
    FROM 
    film_permit;

4.Pattern matching using the regular expressions (REs)
- Basic of REs:

~ : Matches regex pattern (case-sensitive)
~* : Matches regex pattern (case-insensitive)
!~ : Does not match regex pattern (case-sensitive)
!~* : Does not match regex pattern (case-insensitive)
Regex Metacharacters

Metacharacter	Meaning	Example Query	Returns
`\d`	Digit (0–9)	`SELECT '123' ~ '\\d\\d\\d';`	`true`
`\w`	Word character (alphanumeric or underscore)	`SELECT 'abc_123' ~ '\\w+';`	`true`
`\s`	Whitespace character	`SELECT ' ' ~ '\\s';`	`true`
`.`	Any character except newline	`SELECT 'a9c' ~ 'a.c';`	`true`
`^`	Anchors to the beginning of a string	`SELECT 'Hello' ~ '^H';`	`true`
`$`	Anchors to the end of a string	`SELECT 'Hello' ~ 'o$';`	`true`
`*`	Matches 0 or more of the preceding character	`SELECT 'hoop' ~ 'ho*p';`	`true`
`+`	Matches 1 or more of the preceding character	`SELECT 'hoop' ~ 'ho+p';`	`true`
`?`	Matches 0 or 1 of the preceding character	`SELECT 'color' ~ 'colou?r';`	`true`
`[]`	Character class: match one of the characters	`SELECT 'a' ~ '[abc]';`	`true`
`[^]`	Negated character class: not in the set	`SELECT 'x' ~ '[^0-9]';`	`true`
`	`	Alternation (OR)	`SELECT 'dog' ~ 'cat
{% raw %}`()`	Grouping for subpatterns	`SELECT 'abab' ~ '(ab)+';`	`true`

Note: In SQL strings, backslashes (\) must be escaped, so you’ll often use double backslashes (\\d, \\s, etc.).

Filtering Rows with Regex

SELECT * FROM users
WHERE email ~* '^[a-z0-9._%+-]+@[a-z0-9.-]+\\.[a-z]{2,}$';

5.Using REGEXP_REPLACE() used to replace patterns in strings. - - Its similar to REPLACE() function.

Its general syntax is

REGEXP_REPLACE(source,pattern,replace,flags).

pattern- string pattern to match in source string.
replace- replacement string to use in place of the pattern.
flags- an optional string used to control matching. for instance

REGEXP_REPLACE(cars,'\d','_','g')

6.Matching similar strings
postgreSQl does not have function soundex first
u run create extension.
CREATE EXTENSION Fuzzystrmatch

     SOUNDEX(input_string)  -  4 character code

SELECT summons_number, vehicle_color
FROM parking_violation
WHERE
  -- Match SOUNDEX codes of vehicle_color and 'GRAY'
DIFFERENCE(vehicle_color, 'GRAY') = 4;

SELECT summons_number,vehicle_color,
    -- Include the DIFFERENCE() value for each color
    DIFFERENCE(vehicle_color, 'RED') AS "red",
    DIFFERENCE(vehicle_color, 'BLUE') AS "blue",
    DIFFERENCE(vehicle_color, 'YELLOW') AS "yellow"
FROM
    parking_violation
WHERE 
    (
        -- Condition records on DIFFERENCE() value of 4
        DIFFERENCE(vehicle_color, 'RED') = 4 OR
        DIFFERENCE(vehicle_color, 'BLUE') = 4 OR
        DIFFERENCE(vehicle_color, 'YELLOW') = 4
    )

SELECT 
    summons_number,
    vehicle_color,
    DIFFERENCE(vehicle_color, 'RED') AS "red",
    DIFFERENCE(vehicle_color, 'BLUE') AS "blue",
    DIFFERENCE(vehicle_color, 'YELLOW') AS "yellow"
FROM
    parking_violation
WHERE
    (
        DIFFERENCE(vehicle_color, 'RED') = 4 OR
        DIFFERENCE(vehicle_color, 'BLUE') = 4 OR
        DIFFERENCE(vehicle_color, 'YELLOW') = 4
    -- Exclude records with 'BL' and 'BLA' vehicle colors
    ) AND vehicle_color NOT SIMILAR TO 'BLA?'

7.Handling missing data.

Missing data can be caused by human error,systematic issues such as a failed replication process. Types of missing data. -Missing completely at random (MCAR) -Missing at Random (MAR) -Missing not at random(MNAR)

Representations for missing values :

Null (general)
'' - empty string (used for string columns)

Types of missing data

1️⃣ Missing Completely at Random (MCAR)

The data is missing for no reason at all.
No systematic relationship between missing data and other values.
Best kind of missing data - safe to remove or ignore.

2️⃣ Missing at Random (MAR)

For example - Older patient skipping recording their weight.
Systematic relationship between missing data and other observed values.

3️⃣ Missing Not at Random (MNAR).

The data is missing because of the missing value itself.
for example; People with very high incomes don't want to share their income.

Identifying missing data.

Is Null is used to check whether a value is NULL. Null in databases indicates that data is either missing or not applicable which is different from an empty string or zero. example

  select title from film 
  where release_year is null;

select inspection_type,count(*) as count
from restaurant_ispection
where score is null
group by inspection_type
order by count desc;

Rectifying missing data.

Best option: locate and add missing values , may not be feasible, may not be worthwhile.
Provide a value.
Exclude records.

The coalesce() function is designed to evaluate a list of arguments and return the first non-null argument it encounters.

It enhances the readability and robustness of SQL queries.

Features

It has unlimited arguments.
Returns first non null value evaluates each argument in the order they are provided and returns the first one that is not null.
Returns Null if all are null
It stops evaluating if it finds a non null value.
Coalesce can be employed in JOIN conditions to handle nullable columns ensuring that null values do not disrupt the join logic.

replacing null values with unknown

UPDATE countries
SET
--replace null countries values with 'Unknown'
country=Coalesce(national_park,'Unknown');
Select Count(*) From countries

Handling duplicated data.

Duplicates wastes storage resources.
Duplicates distorts analysis.

NOTE
Duplicated values should be addressed with intimate domain knowledge.

The Row_number() function determines where groups of columns values are duplicated in a dataset.An OVER clause must follow the ROW_NUMBER()function call.The OVER clause defines the window(or set of results over which the function will be applied.

ROW_NUMBER() OVER( PARTITION BY 
    plate_id, issue_date,violation_time,house_number,street_name
      ) - 1 AS duplicate

8.Detecting invalid values.

Data may need cleaning when table columns contains values that are not valid for the attribute that the column represents. ### Handling invalid data with pattern matching Using

where score NOT SIMILAR TO '\d+';

No restriction on length of value
Query only restricts non-digits character

Using not similar to get the registration_state not in UPPER CASE and not in 2 characters

SELECT Summons_number,plate_id,registration_state
FROM parking_violation
WHERE registration_state NOT SIMILAR TO '[A-Z{2}']

9.Determining column types.

Select column_name, data_type
from information_schema.columns
Where
table_name = 'restaurant_inspection' AND 
column_name = 'camis';

BIGINT is a data type used to store very large whole numbers safely, much larger than the regular INT type.

INT can handle about ±2 billion (±2^31)

BIGINT can handle about ±9 quintillion (±2^63)

SELECT MAX(summons_number) - MIN(summons_number) AS range_size
FROM parking_violation;

10.Parsing timestamps with TO_TIMESTAMP()
The timestamp syntax is

TO_TIMESTAMP(ts_string,format_string) > TIMESTAMP

TO_TIMESTAMP(inspection_datetime,'YYYY-MM-DD HH24:MI'),

T0_TIMESTAMP(ts_string,format)-
TO_CHAR(ts_value,format) - timestamp - to string

11.Combining columns.
-Concatenation involves joining individual values end-to-end to create a single combined value.

CONCAT(string[,string2,string3,....])

Select CONCAT ( name,E'\n',
building, ' ', street, E'\n',
boro, ', NY', Zip_code ) AS
mailing_address
FROM
restaurant_inspection;

|| double pipe operator, like CONCAT() joins the values that surround the double pipe.
string1 || string2 [ || string3 || ...]

select 'data' || ' ' || 'cleaning' || ' ' || 'is' || ' '|| 'fun';
11.Splitting columns.
-STRPOS(source_string, search_string).

It returns the integer representing the first position containing search string in source string.
For instance to find the position of first '-' we use

  SELECT 
     STRPOS(house_number,'-') AS dash_position
  FROM
     parking_violation;

-SUBSTRING(source_string FROM start_pos For num_chars)

SELECT
  house_number,
  SUBSTRING(
    house_number
    FROM STRPOS(house_number, '-') + 1
    FOR LENGTH(house_number) - STRPOS(house_number, '-')
  ) AS new_house_number
FROM
parking_violation;

12.Splitting strings using SPLIT_PART()

SPLIT_PART(source_string,delimiter_string,part_number) - The SPLIT_PART() function requires 3 arguments: a source string, the delimiter on which to split and the index of the component to return from the split components.

 SELECT 
      SPLIT_PART ('Cycle Inspection / Re-inspection', ' / ', 2 );

REGEXP_SPLIT_TO_TABLE(source,pattern) - the REGEXP_SPLIT_TO_TABLE() function splits string values in a source column using a regular expression pattern. For instance

SELECT REGEXP_SPLIT_TO_TABLE('Pizza/Italian', '/' );