Data Transformation Made Easy with PowerShell – Part 1

This is 1st part of the third post from the Data Manipulations with PowerShell series, where we are diving into the specific phase of data operation processes and apply/use PowerShell along the way.

During this phase data is being converted and modified from one format, structure, or representation to another. It involves manipulating and reorganizing data in order to make it suitable for a particular analysis, system, or application.

Data transformation is often performed before data analysis, modeling, or loading data into a target system. It helps in improving data quality, resolving inconsistencies, and ensuring data compatibility.

In general, the few common types of data transformation are:

Data Cleaning – handling missing data, dealing with outliers, correcting errors, and resolving inconsistencies in the data. Used to ensure that data is accurate and reliable for further analysis.
Data Integration – combines and merges data from different formats, structures, or databases into a unified format. Usually helps to create a consolidated view of the data, enabling comprehensive analysis.
Data Discretization– transforms continuous data into categorical or ordinal data by creating bins or intervals. Often used to simplify complex relationships or reducing the impact of outliers.
Data Encoding – converts categorical data into numerical representations to make it suitable for certain algorithms or models. Common encoding techniques include one-hot encoding, label encoding, and ordinal encoding.
Data Normalization – scales numerical data to a common range or distribution. Used to prevent certain variables from dominating the analysis due to their larger magnitude.
Data Reduction – reduce the dimensionality of the dataset while preserving important information. Examples include principal component analysis (PCA) and feature selection methods.
Data Conversion – convert data from one format to another, such as converting CSV files to XML or JSON.

As part of our common procedure for this series, we will look into out of the box provided cmdlets as well the modules and scripts authored by awesome PowerShell community

Data Cleaning

Data cleaning itself have couple of items to deal with:

Missing values
Outliers
Duplicates
Inconsistency & errors

Out of the Box

In terms of the out of the box cmdlets, most of them are pretty familiar to us at this point. We used them a bunch in the previous (and most likely future) posts from these series. As usual our “one size fits all” are – Where-Object, Sort-Object, ForEach-Object and “old but gold” regular expressions.

Let’s look into some of the examples on how to apply PowerShell to Data Cleaning use cases.

Missing values

To handle missing values, we first need to identify them in specific columns or rows. We can utilize Where-Object and Group-Object to achieve this.

#Check if there are any missing values in Label column
$DataCollection | Group-Object -Property Label

After that we will need to replace missing data with some placeholder like “No value“, “N/A“, “Unknown“, 0 (if the column supposed to have only numbers in it) etc.

$DataCollection | ForEach-Object {
    if ([string]::IsNullOrEmpty($_.Label)) {
        $_.Label = "No Label"  # Replace missing values with 'No Label'
    }
}

Outliers

The usual first step is to identify outliers, and the same set of cmdlets as previously can be used: Group-Object, Where-Object.

#Check if there are any outliers in SensitiveInformationCount column
$DataCollection | Group-Object -Property SensitiveInformationCount

Once we identified the outliers, we can remove them from our data set or transform them to mitigate the impact on the data.

$threshold = 1000  # Set the threshold for outlier detection

$CleanedData = $DataCollection | Where-Object { [int]$_.SensitiveInfomationCount -lt $threshold }

Duplicates

To tackle this use case, we need to make sure to select only unique values. This can be achieved using Sort-Object cmdlet. Specifically, by using its -Unique parameter.

Inconsistency & Errors

Usually, inconsistencies are related to formatting, like phone number, SSN or Credit card numbers with or without the dashes.

With PowerShell we can use effortless String.Replace or time-tested and reliable regular expressions.

Note: Dataset used for the example is a dummy DLP data.

Community Provided

Number of modules with *clean*, *outlier* , *miss*, *duplicate*, *error*, *data cleaning* in the name – 906.

I would like to highlight few of them:

ColumnMapper by Eric Hanko – finds all unique values in the column specified by the KeysHeader parameter and converts them to hash keys.
TM-DataManipulation by Taylor Marvin – contains classes and functions designed to handle ranges of integers, and perform string operations.

Data Integration

Main goal here is to create a consolidated view of the data and set the ground for comprehensive analysis and decision-making in future phases.

Out of the Box

PowerShell provides out of the box mechanisms to merge and combine data, such as arrays, hash tables, or custom objects. Leading role here is taken by Join-Object cmdlet tha can be used to merge data based on common keys or attributes. Although this cmdlet is not part of the built in modules, the approach and proof of concept have been presented back in 2012 (more than 10 years ago) by Lucio Silveira at PowerShell Team Dev Blog.

It supposed to work similarly to SQL Join and link two collections based on some common denominator.

Here is one of the ways to do it using Where-Object:

foreach($r in $DataSet1){
 $DOB=$DataSet2 | Where {$_.SSN -eq $r.SSN} | Select -ExpandProperty "Data of Birth (DOB)"
 $r | Add-Member -NotePropertyName "DOB" -NotePropertyValue $DOB
}

The drawback of this approach is that we are altering the input object by adding additional members to it and it might create more complexities going forward.

With Join-Object a new object is being created with the properties that you outline, leaving original objects intact:

Join-Object -Left $DataSet1 -Right $DataSet2 -Where {$args[0].SSN -eq $args[1].SSN} -Type AllInLeft -LeftProperties "First and Last Name","SSN","Credit Card Number" -RightProperties "Data of Birth (DOB)"

Note: Dataset used for the example is a dummy DLP data.

Community Provided

Number of modules with *join* and *merge* – 275.

Few of them worth of highlighting:

Join & JoinModule by Ronald Bode (aka iRon7) – combines two object lists based on a related property between them.
Join-Object by ili (aka ili101) – join data from two sets of objects based on a common value. LINQ Edition
Join-String by saw-friendship – join String from Array.
Join-Strings by Joel Sallow – pipeline-ending cmdlet that will join all strings from the preceding pipeline.
markdownmerge by @DonovanBrown (aka DarqueWarrior) – port of Markdown-include.
MergeCsv by Joakim Borger Svendsen – merge CSV files and/or custom PowerShell objects based one or more shared ID properties.

Data Discretization

The purpose of discretization is to transform continuous data into discrete intervals or categories. It also helps to facilitate pattern recognition.

There are three well known discretization techniques:

Equal Width Binning
Equal Frequency Binning
Custom Binning

Out of the Box

The typical built in cmdlets that we will be utilizing are: Measure-Object, Sort-Object, Where-Object and Add-Member. Majority of aforementioned cmdlets are part of the PowerShell.Utility module. To read more information about PowerShell.Utility module its cmdlets and use cases go to Unleashing the Full Potential of the PowerShell.Utility Module.

Equal Width Binning

This technique is useful when the data distribution is uniform. It divides the range of values into fixed number of intervals with equal width.

Let’s look at the example, where we have column “Age” and we would like to display it with width of 3 and 3 intervals.

# Determine the number of intervals and width
$numIntervals = 3
$minValue = $Dataset2 | Measure-Object -Property Age -Minimum | Select-Object -ExpandProperty Minimum
$maxValue = $Dataset2 | Measure-Object -Property Age -Maximum | Select-Object -ExpandProperty Maximum
$intervalWidth = ($maxValue - $minValue) / $numIntervals

# Discretize the data
foreach($r in $Dataset2){
	$interval = [math]::Floor(($r.Age - $minValue) / $intervalWidth)
	$discretizedValue = "$($minValue + ($interval * $intervalWidth)) - $($minValue + (($interval + 1) * $intervalWidth))"
	$r | Add-Member -NotePropertyName "DiscretizedAge" -NotePropertyValue $discretizedValue

}

Equal Frequency Binning

This technique divides the data into intervals of equal frequency, ensuring that each interval contains a similar number of data points.

In our example dataset we have a numeric column called “Salary” and we will be discretizing it into three equal-frequency intervals.

$numIntervals = 3
$sortedData = $Dataset3 | Sort-Object -Property Salary
$chunkSize = [math]::Ceiling($sortedData.Count / $numIntervals)
$discretizedData = $sortedData | ForEach-Object -Begin { $index = 1; $interval = 1 } -Process {
    $_ | Add-Member -NotePropertyName "Discretized-Salary" -NotePropertyValue "Interval $interval"
    $index++
    if ($index -gt $chunkSize) {
        $index = 1
        $interval++
    }
    $_
}

Custom Binning

This one allows to define custom intervals based on specific requirements or domain knowledge.

Let’s dive into the example.

$intervals = @(
    @{ Min = 0; Max = 14; Label = "Children" },
    @{ Min = 14; Max = 24; Label = "Youth" },
    @{ Min = 25; Max = 64; Label = "Adults" },
    @{ Min = 65; Max = 150; Label = "Seniors" }
)

foreach($r in $Dataset3){
	 $discretizedValue = $intervals | Where-Object { $r.Age -ge $_.Min -and $r.Age -lt $_.Max } | Select-Object -ExpandProperty Label
    $r | Add-Member -NotePropertyName "Discretized-Age" -NotePropertyValue $discretizedValue

}

Next: Data Transformation Made Easy with PowerShell – Part 2.

Data icons created by Freepik – Flaticon.

Thanks a lot for reading.

Follow @AndySvints

Data Transformation Made Easy with PowerShell – Part 1

Data Cleaning

Out of the Box

Missing values

Outliers

Duplicates

Inconsistency & Errors

Community Provided

Data Integration

Out of the Box

Community Provided

Data Discretization

Out of the Box

Equal Width Binning

Equal Frequency Binning

Custom Binning

Leave a Reply Cancel reply

PowerShell Gallery Prior Day Statistics – 4/28/2024

PowerShell Gallery Prior Day Statistics – 4/27/2024

PowerShell Gallery Prior Day Statistics – 4/15/2024

PowerShell Gallery Prior Day Statistics – 4/14/2024

PowerShell Gallery Prior Day Statistics – 4/13/2024

PowerShell Gallery Prior Day Statistics – 4/7/2024

PowerShell Gallery Prior Day Statistics – 3/31/2024

PowerShell Gallery Prior Day Statistics – 3/30/2024