This is one of the posts from the PoweshellIT series in which we get common and sometimes not so common usecases and try to simplify/automate them using PowerShell.
Today’s Use Case
Identify identical files(twins) in the folder. Not just with same metadata (file name, author, file size, timestamps) but truly identical files which have the same content.
Disclaimer: We will be just finding duplicate files and reporting result to user. What to do with the duplicates(remove, keep both, ignore) it is totally up to user.
The idea is to compare file contents pretty quickly. We do not need to compare file names and timestamps just to identify two files with identical content. The fast and reliable way to identify file twins (with identical content) is to use hashing.
Hashing or hash function is a one way function which generates fixed size output called hash value based on the input of different length.
Being a one way function means that it is practically infeasible to invert.
And luckily for us PowerShell has a built in cmdlet
Get-FileHash. (“Oh how convenient“).
This cmdlet “computes the hash value for a file by using a specified hash algorithm”. Default Hashing algorithm is SHA256 which is a member of SHA-2 family and is using 256bit key to compute the hash value.
We will cover two variations of our use case:
- Find Duplicates within the folder
- Find duplicate for a specified file within the folder
- Acceptable performance as computing hash values for high number of files will require compute resources;
- Once the duplicates are found provide meaningful output which will not confuse users;
Simple and elegant PowerShell function which will accept path to folder where duplicates(twins) should be found. Also an additional(optional) parameter, a path to file which should be considered as baseline and identify any of duplicates of it.
#Find Duplicates within folder Get List of files in the Directory Generate hash values for each file Compare hashes and identify any duplicates #Find Duplicate of a file within folder Generate hash value for a BaseFile Get List of file in the Directory Generate hash values for each file Compare baseline hash to Directory files hash values and identify duplicates
It has been wrapped into PowerShell module called FileTwin.
Module contains one function called
Find file duplicates in the specified folder or look for a duplicate of the provided file within the specific location.
Identify duplicates in the directory.
Find-FileTwin -Path C:\Users\andys\Downloads\ -Verbose
Find duplicate for a specific file within directory.
Find-FileTwin -Path C:\MigrationDestination\test2\Files\Downloads\ -File C:\MigrationDestination\test2\Files\laptop.png -Verbose
All of the source code is available in PowerShellIT repository on the GitHub.
Thanks a lot for reading.