Greetings;
Over the years, I've collected a metric tonne of data. Naturally, as new computers entered the household, I would migrate archives of files from one computer to the next. Before long, I ended up with an enormous amount of duplication. Since I'm a *nix fan, the platform and scripting is all Linux based so, as I describe this, I'll be speaking in those terms.
I ran a little utility called FSLINT Janitor. One of the little features of this utility is that it will recursively search your directory structures for duplicates. It uses MD5 values as it means of determining duplication. Well... it found 90GB worth of duplicates (~43,900 files). You'd have to know what I do to understand why there is so much :) Anyhow, the output looks like this;
#2 x 41,747,941 (41,750,528) bytes wasted
/home/APPS/Temp-A/Visual Studio 2008 Dev/msdn/cab35.cab
/home/APPS/VS8/Visual Studio 2008 Dev/msdn/cab35.cab
#5 x 10,363,392 (41,467,904) bytes wasted
/home/APPS/Temp-A/Visual Studio 2008 Dev/TFC/WCU/DExplore/DExplore.exe
/home/APPS/Temp-A/Visual Studio 2008 Dev/WCU/DExplore/DExplore.exe
/home/APPS/VS8/Visual Studio 2008 Dev/TFC/WCU/DExplore/DExplore.exe
/home/APPS/VS8/Visual Studio 2008 Dev/WCU/DExplore/DExplore.exe
/home/APPS/VS8/Visual Studio 2008 Dev/msdn/WCU/DExplore/DExplore.exe
#2 x 40,891,792 (40,894,464) bytes wasted
/home/Download/KindleForPC-installer.exe
/home/FROMGENERIC/Downloads/KindleForPC-installer.exe
So now I have a list of all the dupes and where they are located. Obviously, it's fairly easy to script this to remove the excess and keep one, or so one would think. I wrote a little programme to read the output file and like where there are 5 dupes... mark 4 and keep 1. Since I used to know TCL fairly well, I chose to write a TCLSH script;
------------------------
#!/usr/bin/tclsh
#set words [exec /usr/bin/md5sum $line]
proc rFile {_inFile} {
global cmp iFile
set iFile $_inFile
set marker "#"
set iFile [ open $_inFile r]
while {[gets $iFile line] >= 0} {
set cmp [string first $marker $line]
if {$cmp == 0} {
puts "MARK: $line"
set kk "[string range $line 1 1]"
set xx [expr $kk-1]
} else {
set stop [expr $xx-1]
if {$stop == 0} {
set words $line
puts "DELETE: $stop $words"
set stop [expr $xx-1]
} else {
set words $line
puts "KEEP: $stop $words"
}
}
}
close $iFile
}
rFile {dupes.txt}
------------------------
Please don't laugh at my shoddy coding because, it works all the way until I get to the bold area. The current output looks something like this;
MARK: #3 x 1,142,274,512 (2,284,560,384) bytes wasted
KEEP: 1 /home/PICTURES/ACamera/Paraparaumu/20131226_154036.mp4
KEEP: 1 /home/PICTURES/GALLERY/Paraparaumu/20131226_154036.mp4
KEEP: 1 /home/VAR/html/Movies/20131226_154036.mp4
MARK: #3 x 1,068,809,035 (2,137,628,672) bytes wasted
KEEP: 1 /home/PICTURES/ACamera/Wellington Zealandia/20131228_125608.mp4
KEEP: 1 /home/PICTURES/GALLERY/WellingtonZealandia/20131228_125608.mp4
KEEP: 1 /home/VAR/html/Movies/20131228_125608.mp4
MARK: #3 x 936,830,839 (1,873,674,240) bytes wasted
KEEP: 1 /home/PICTURES/ACamera/Akatarawa Valley/20131228_165043.mp4
KEEP: 1 /home/PICTURES/GALLERY/AkatarawaValley/20131228_165043.mp4
KEEP: 1 /home/VAR/html/Movies/20131228_165043.mp4
I cannot remember how to structure decrements such that it marks 4, keeps 1 (etc). Maybe I've structured it wrong altogether and simply coded myself into a corner. It's just been so freaking long but, I think a coder will capture what it is I want to do. Right now, I'm simply putting DELETE or KEEP as an output statement for debugging. Ultimately it will be replaced with set var [exec rm -ef $fName] or something similar.
Any thoughts or advice will be greatly welcomed and appreciated.
Cheers