Thursday, January 5, 2012

Powershell (v2) - Splitting Up Very Large File Sets into Subfolders

In my most recent project, I needed to do some analysis on a collection of files.  When I started working on the data I realized I was in trouble when I opened the folder in Explorer and it took 3 minutes to start displaying files.  I quickly navigated up a level and selected the folder properties to find out I was dealing with over 474 thousand files.  Unless something changed, this was going to be unmanageable.  So, I decided to restructure the data by dividing it up into folders of 10,000 file each.  I had done something like this earlier, but, not in a while.  Because of this, I needed to reinvent some code.  Nonetheless, it turned out to be fairly easy.  Thanks to the folks over on the Technet forums, once again, for helping me come up with a good solution to this problem.  Here are the key posts I made to try and get my head around formulating a plan:
I'll explore the various issues I dealt with in the these threads individually.  For this post, I primarily wanted to share the script I came up with the handle the excessively large file collection.  Without further rambling:
cls

Start-Transcript -Path (Join-Path ($scriptpath = "C:\Documents and Settings\will\My Documents\scripts") -ChildPath "transcript.txt");

#region variables

$path = "C:\Documents and Settings\will\My Documents\data";
$filecount = (Get-ChildItem $path).Count
$maxfilecount = 1000;

#endregion variables

#region functions

# Define Write-DateTime function to alias
function Write-DateTime {
return (Get-Date).ToString("yyyyMMdd hh:mm:ss");
}

# Set alias for quick reference
Set-Alias -Name wdt -Value "Write-DateTime";

#endregion functions

#region Script Body

# Output status to host.
Write-Output "$(wdt): Gathering file names.";

# Get file count
$filecount = (Get-ChildItem $path | Where-Object {$_.PSIsContainer -ne $true}).Count;

# Output status to host.
Write-Output "$(wdt): Processing files.";

# Enumerate folders based on filecount and maxfilecount parameter
for($i = 1; $i -le ($filecount/$maxfilecount); $i++) {

# Clear the $files objects.
$files = $null;

# Set the foldername to zero-filled by joining the $path and counter ($i)
$foldername = Join-Path -Path ($path) -ChildPath ("{0:00000}" -f $i);

# Output status to host.
Write-Output "$(wdt): Creating folder $foldername.";

# Create new folder based on loop counter
New-Item -Path $foldername -ItemType Directory | Out-Null;

# Output status to host.
Write-Output "$(wdt): Gathering files for $foldername.";

# Break files into smaller collections to move to subfolders.
$files = Get-ChildItem $path | Where-Object {$_.PSIsContainer -ne $true} | select -first $maxfilecount;

# Output status to host.
Write-Output "$(wdt): Moving files to $foldername.";

# Enumerate collection.
foreach($file in $files) {
# Output status to host.
Write-Output "$(wdt): Moving $($file.fullname).";

# Move files in collection to subfolder.
Move-Item -Path $file.fullname -Destination $foldername;
}
}

# End logging.
Stop-Transcript

#endregion Script Body

2 comments:

  1. Side effect :

    #Output status to host OR pipeline
    Write-Output "$(wdt): Processing files.";

    #Output status to host
    Write-Host "$(wdt): Processing files.";

    ;-)

    ReplyDelete
  2. Thanks and good catch. I have fought battles with Write-Host/Write-Output and didn't think to include the distinction. I started focusing more on this intentionally after reading Don Jones' article:

    http://www.windowsitpro.com/blog/powershell-with-a-purpose-blog-36/scripting-languages/what-to-do--not-to-do-in-powershell-part-1-137475

    ReplyDelete