Data Cleaning In SQL

Ivan Zakharchuk
3 min readJun 7, 2021

--

What is Data Cleaning

Data cleaning is foundational skill to be a Data Scientist or Data Analyst. By definition, data cleaning is a process of cleaning up raw data to make it usable and ready for analysis. Following are most common cases when data cleaning need to be preformed.

  • All data lumped together in a single column and you need to parse it to extract necessary information.
  • Data could default to string data types, and you need to cast each column appropriately to run computations.
  • Data could have un-standardized units and you need to normalize the column to ensure equally comparing across records.

Methods for Data Cleaning

The following set of methods cover three different types of data cleaning: changing the data type of the information, parsing information, returning where information located.

  • Left: Extracts a number of characters from a string starting from the left
  • Right: Extracts a #of characters from a string starting from the right
  • Substr: Extracts a substring from a string (starting at any position)
  • Position: Returns the position of the first occurrence of a substring.
  • Strpos: Returns the position of a substring within a string
  • Concat: Adds two or more expressions together
  • Cast: Converts a value of any type into a specific, different data type
  • Coalesce: Returns the first non-null value in a list A handful of these functions, as you’ll quickly realize, are more commonly used than others.

Left, Right useful when a single column holds too much info from a raw data dump and needs to be parsed to make the data usable.

Substr method extracting substring from the string at any position. Use case the same as for previous methods, when column contain too much information and needs to be parsed. Second parameter should be starting position of substring and the 3rd parameter is how many elements to extract.

Concat function adds two or more expressions together. Common use when a unique identifier is split across multiple columns and the user has a need to combine them.

Cast function converts a value of any type into a specific, different data type. Useful when the raw data types are unsuitable for analyses. The most common use is when the raw data types all default to strings, and the user has to cast each column to the appropriate data type. Syntax for this function will be : CAST(column_name AS desire_datatype).

Position/Strpos Used to return the position of information to identify where relevant information is held in a string to then extract across all records

Coalesce: Used to return the first non-null value that’s commonly used for normalizing data that’s stretched across multiple columns and includes NULLs.

Conclusion

It is almost always need to do some type of cleaning on a dataset before actually been able to solve a problem. It is helpful to follow the next steps when starting off with data cleaning in SQL.

  1. Review the problem statement.
  2. What data do you have? What data do you need?
  3. How will you adjust existing data or create new columns?
  4. Leverage cleaning techniques to manipulate data.
  5. Leverage analysis techniques to determine the solution.

Additional readings and documentation for Data Cleaning can be find here.

--

--