Geekzone: technology news, blogs, forums
Guest
Welcome Guest.
You haven't logged in yet. If you don't have an account you can register now.


JimsonWeed

126 posts

Master Geek
+1 received by user: 4
Inactive user


#207511 30-Dec-2016 11:55
Send private message

Greetings;


Over the years, I've collected a metric tonne of data.  Naturally, as new computers entered the household, I would migrate archives of files from one computer to the next.  Before long, I ended up with an enormous amount of duplication.  Since I'm a *nix fan, the platform and scripting is all Linux based so, as I describe this, I'll be speaking in those terms.


I ran a little utility called FSLINT Janitor.  One of the little features of this utility is that it will recursively search your directory structures for duplicates.  It uses MD5 values as it means of determining duplication.  Well... it found 90GB worth of duplicates (~43,900 files).  You'd have to know what I do to understand why there is so much :)  Anyhow, the output looks like this;


#2 x 41,747,941 (41,750,528)    bytes wasted
/home/APPS/Temp-A/Visual Studio 2008 Dev/msdn/cab35.cab
/home/APPS/VS8/Visual Studio 2008 Dev/msdn/cab35.cab
#5 x 10,363,392 (41,467,904)    bytes wasted
/home/APPS/Temp-A/Visual Studio 2008 Dev/TFC/WCU/DExplore/DExplore.exe
/home/APPS/Temp-A/Visual Studio 2008 Dev/WCU/DExplore/DExplore.exe
/home/APPS/VS8/Visual Studio 2008 Dev/TFC/WCU/DExplore/DExplore.exe
/home/APPS/VS8/Visual Studio 2008 Dev/WCU/DExplore/DExplore.exe
/home/APPS/VS8/Visual Studio 2008 Dev/msdn/WCU/DExplore/DExplore.exe
#2 x 40,891,792 (40,894,464)    bytes wasted
/home/Download/KindleForPC-installer.exe
/home/FROMGENERIC/Downloads/KindleForPC-installer.exe


So now I have a list of all the dupes and where they are located.  Obviously, it's fairly easy to script this to remove the excess and keep one, or so one would think.  I wrote a little programme to read the output file and like where there are 5 dupes... mark 4 and keep 1.  Since I used to know TCL fairly well, I chose to write a TCLSH script;


------------------------


#!/usr/bin/tclsh

#set words [exec /usr/bin/md5sum $line]

proc rFile {_inFile} {
  global cmp iFile
  set iFile $_inFile
  set marker "#"
  set iFile [ open $_inFile r]
  while {[gets $iFile line] >= 0} {
    set cmp [string first $marker $line]
    if {$cmp == 0} {
       puts "MARK: $line"
       set kk "[string range $line 1 1]"
       set xx [expr $kk-1]
    } else {
       set stop [expr $xx-1]
         if {$stop == 0} {
           set words $line
           puts "DELETE: $stop $words"
           set stop [expr $xx-1]
         } else {
           set words $line
           puts "KEEP: $stop $words"
         }
    }
  }
  close $iFile
}
rFile {dupes.txt}


------------------------


Please don't laugh at my shoddy coding because, it works all the way until I get to the bold area. The current output looks something like this;


MARK: #3 x 1,142,274,512    (2,284,560,384)    bytes wasted
KEEP: 1 /home/PICTURES/ACamera/Paraparaumu/20131226_154036.mp4
KEEP: 1 /home/PICTURES/GALLERY/Paraparaumu/20131226_154036.mp4
KEEP: 1 /home/VAR/html/Movies/20131226_154036.mp4
MARK: #3 x 1,068,809,035    (2,137,628,672)    bytes wasted
KEEP: 1 /home/PICTURES/ACamera/Wellington Zealandia/20131228_125608.mp4
KEEP: 1 /home/PICTURES/GALLERY/WellingtonZealandia/20131228_125608.mp4
KEEP: 1 /home/VAR/html/Movies/20131228_125608.mp4
MARK: #3 x 936,830,839    (1,873,674,240)    bytes wasted
KEEP: 1 /home/PICTURES/ACamera/Akatarawa Valley/20131228_165043.mp4
KEEP: 1 /home/PICTURES/GALLERY/AkatarawaValley/20131228_165043.mp4
KEEP: 1 /home/VAR/html/Movies/20131228_165043.mp4


I cannot remember how to structure decrements such that it marks 4, keeps 1 (etc).  Maybe I've structured it wrong altogether and simply coded myself into a corner.  It's just been so freaking long but, I think a coder will capture what it is I want to do.  Right now, I'm simply putting DELETE or KEEP as an output statement for debugging.  Ultimately it will be replaced with set var [exec rm -ef $fName] or something similar.


Any thoughts or advice will be greatly welcomed and appreciated.


Cheers


Create new topic
JimsonWeed

126 posts

Master Geek
+1 received by user: 4
Inactive user


  #1696522 30-Dec-2016 16:29
Send private message

Well, I guess we can disregard this... I solved it  :)  I'll tweak it for it's intended purpose but, if you use JSLINT to find duplicates on a Linux box and want to delete them quickly... Here you go.  Run this on the output file from JSLINT and redirect the output to another file.   ./dupeKill.tcl > kDupe.sh   chmod +x kDupe.sh    and then off you go.

 

---------------------------------------

 

#!/usr/bin/tclsh

proc rFile {_inFile} {
  global cmp iFile
  set iFile $_inFile
  set marker "#"
  set iFile [ open $_inFile r]
  puts "#!/bin/bash"

 

  while {[gets $iFile line] >= 0} {
    set cmp [string first $marker $line]
    if {$cmp == 0} {
       set kk "[string range $line 1 1]"
       set xx [expr $kk-1]
       set jj $xx
    }

 

    if {!$cmp == 0} {
       if {$jj > 0} {
         puts "rm -rf \"$line\""
       } else {
         puts "mv \"$line\" /var/archive"
       }
    set jj [expr $jj-1]
    }
  }
  close $iFile
}
rFile {dupes.txt}

 

---------------------------------------


Create new topic








Geekzone Live »

Try automatic live updates from Geekzone directly in your browser, without refreshing the page, with Geekzone Live now.



Are you subscribed to our RSS feed? You can download the latest headlines and summaries from our stories directly to your computer or smartphone by using a feed reader.