Geekzone: technology news, blogs, forums
Guest
Welcome Guest.
You haven't logged in yet. If you don't have an account you can register now.




2425 posts

Uber Geek
+1 received by user: 24


Topic # 142589 18-Mar-2014 09:51 Send private message

We're running a project at work, part of this requires us to copy over 12 million files and the same if not more in folders, over to another machine.

We want to do verification that all the files are there,i.e. compare the prod system to the new system we are implementing.

I've tried using an MD5 checker, which eventually bombed out because there are just so many files and folders.


I'm trying a method I wrote in powershell to check that all the files are the same and exist (note, not MD5 check or anything), we are just wanting to find out where/if we are missing gaps, but I've had it running for over 24 hours and it's only scanned 3.5 days of each year, across 8 years worth of data. so it's going to take AGGGGES.



Md5 would be nice, but it's the highest priority at this stage, but would like it to come later.


Any ideas on what I may be able to do? 




This is a signature.



View this topic in a long page with up to 500 replies per page Create new topic
 1 | 2
Voice Engineer @ Orcon
1912 posts

Uber Geek
+1 received by user: 432

Trusted
Orcon
Subscriber

  Reply # 1007902 18-Mar-2014 09:55 Send private message

Can you do a dump of filename, size, and MD5 on each machine, then compare them?  May be quicker this way than doing the whole thing from one end.

3015 posts

Uber Geek
+1 received by user: 196

Trusted
Subscriber

  Reply # 1007927 18-Mar-2014 10:33 One person supports this post Send private message

Are you sure a filesystem is the best way to store that many files? Object orientated would be better?







2425 posts

Uber Geek
+1 received by user: 24


  Reply # 1007930 18-Mar-2014 10:37 Send private message

Hi guys,

ubergeeknz: Can you do a dump of filename, size, and MD5 on each machine, then compare them?  May be quicker this way than doing the whole thing from one end.


The powershell script I wrote just pulls the relative file/folder path in plain text and puts it in a text file. Unfortunately the paths can be quite long, even relative, so this isn't working quite as well as I had hoped.
Adding file size and MD5 would probably kill it faster :) 

Zeon: Are you sure a filesystem is the best way to store that many files? Object orientated would be better?


You're dead right, but it's not my choice. It's a technical limitation of the software the files are for. The developers decided that was the way to do it when they developed it. So this is what we are stuck with.




This is a signature.



Voice Engineer @ Orcon
1912 posts

Uber Geek
+1 received by user: 432

Trusted
Orcon
Subscriber

  Reply # 1007936 18-Mar-2014 10:42 Send private message

Surely regardless of the number of files it is just a matter of time to process them all?  What kind of limitation are you running into?



136 posts

Master Geek
+1 received by user: 26


  Reply # 1007938 18-Mar-2014 10:43 Send private message

I use Beyond Compare quite often to compare specific files, but it had a file\folder match too.
Not sure how it would handle that many files (what the limitations of the trial are) but might be worth a shot?

http://www.scootersoftware.com/moreinfo.php



2425 posts

Uber Geek
+1 received by user: 24


  Reply # 1007942 18-Mar-2014 10:44 Send private message

ubergeeknz: Surely regardless of the number of files it is just a matter of time to process them all?  What kind of limitation are you running into?




EDIT: Sorry, just saw you might be speaking about the actual software its for. It's for a call recording suite called Verint. Unfortunately we are on an old version, so they may have changed it in later versions on how they store files, but for our version it's just files in a folder structure, then a database for lookup if I remember correctly for the web-end, which contains location data to where the file is located.



Using an Md5 tool, the limitation I ran into was that the tool crashed, which I presume ony happened because there is such a large structure of files and folders. 

Using the powershell method, its made a text file over 500MB now, and still counting; not the easiest to process.
I may just refine my powershell script to split it up into files into 100MB lots, if theres no other way of doing it.




This is a signature.



gzt

4587 posts

Uber Geek
+1 received by user: 244

Subscriber

  Reply # 1008013 18-Mar-2014 12:08 Send private message

Have a look at your disk and cpu while the process is running. It is likely both are under-utilised. Simple solution. Start multiple instances of your md5 process aiming them at different directories whatever. Repeat until you reach a utilisation limitation above. Network might be included if different box.

gzt

4587 posts

Uber Geek
+1 received by user: 244

Subscriber

  Reply # 1008016 18-Mar-2014 12:13 Send private message

There are also free and open source that will assist in your aim.

516 posts

Ultimate Geek
+1 received by user: 103


  Reply # 1008100 18-Mar-2014 14:06 Send private message

Sounds like you should get a programmer to build you a multithreaded utility in C# that can utilize all CPU's and hardware threads.

It's very easy to do.




Regards
Stefan Andres Charsley

39 posts

Geek
+1 received by user: 3


  Reply # 1008105 18-Mar-2014 14:17 Send private message

charsleysa: Sounds like you should get a programmer to build you a multithreaded utility in C# that can utilize all CPU's and hardware threads.

It's very easy to do.


Yeah very easy, you should add a column to the DB with a hash as well so you can verify the files in the future nothing worse than not being able to tell there is 1 file missing or corrupt

Also you could use rsync to copy the files instead

738 posts

Ultimate Geek
+1 received by user: 21


  Reply # 1008112 18-Mar-2014 14:24 Send private message

Aaroona: We're running a project at work, part of this requires us to copy over 12 million files and the same if not more in folders, over to another machine.

We want to do verification that all the files are there,i.e. compare the prod system to the new system we are implementing.


So if I understand, you want to copy files from Machine A to Machine B, verifying them at the same time?  And only copy where the files don't already exist?

You can just use XCOPY from the command line in Windows, and use /D to only copy new files, and /V to verify the files against the source.  To get from one machine to another just map a network drive, using NET USE.  No need to write a program.  You can even redirect the results into a file:

XCOPY [source] [target] /D /S /Y /V > logfile.txt


Voice Engineer @ Orcon
1912 posts

Uber Geek
+1 received by user: 432

Trusted
Orcon
Subscriber

  Reply # 1008114 18-Mar-2014 14:28 Send private message

Aaroona: Using the powershell method, its made a text file over 500MB now, and still counting; not the easiest to process.


Provided you expect perfectly consistent output from both A and B servers, just hash the resulting file on each server :)

115 posts

Master Geek
+1 received by user: 37


  Reply # 1008135 18-Mar-2014 14:59 Send private message

I have 500K files that I need to verify periodically. I use FreeFileSync. It works well, takes about half an hour, and has a bunch of options. It's also cross-platform.


JWR

105 posts

Master Geek
+1 received by user: 23


  Reply # 1008152 18-Mar-2014 15:26 Send private message

I use this for copying and syncing large numbers of files between locations....

http://sourceforge.net/projects/freefilesync/

424 posts

Ultimate Geek
+1 received by user: 66

Trusted
Subscriber

  Reply # 1008194 18-Mar-2014 16:16 Send private message

http://arcainsula.co.nz/2013/copying-large-amounts-of-data-with-linux/

I wrote this blog to address that. Although it refers to use under Ubuntu, you can use the same instructions with Cygwin. The files are checksummed on the fly.

It is a stable long existing and supported means of doing that one task - copying large amounts of data, reliably.

 1 | 2
View this topic in a long page with up to 500 replies per page Create new topic




Twitter »
Follow us to receive Twitter updates when new discussions are posted in our forums:



Follow us to receive Twitter updates when news items and blogs are posted in our frontpage:



Follow us to receive Twitter updates when tech item prices are listed in our price comparison site:





Trending now »

Hot discussions in our forums right now:

American legal jurisdiction in New Zealand
Created by ajobbins, last reply by ajobbins on 20-Oct-2014 22:53 (22 replies)
Pages... 2


Another Trade Me competitor: SellShed
Created by freitasm, last reply by mattwnz on 20-Oct-2014 15:16 (22 replies)
Pages... 2


Why would Suresignal calls be worse quality than non-Suresignal calls from the same location?
Created by Geektastic, last reply by gzt on 20-Oct-2014 23:43 (39 replies)
Pages... 2 3


Picture resizing on the forum
Created by Jase2985, last reply by freitasm on 18-Oct-2014 13:32 (13 replies)

Internet question...
Created by Geektastic, last reply by Geektastic on 17-Oct-2014 22:59 (40 replies)
Pages... 2 3


Why do people keep thinking National are doing a great job?
Created by sxz, last reply by Geektastic on 20-Oct-2014 23:05 (156 replies)
Pages... 9 10 11


Just bought a TiVo online. No wireless adaptor. Will a standard one work? Or do I need the TiVo one ?
Created by Limerick, last reply by graemeh on 20-Oct-2014 16:03 (11 replies)

iPad Air 2 and iPad Mini 3. Gonna get one?
Created by Dingbatt, last reply by alexx on 20-Oct-2014 13:34 (45 replies)
Pages... 2 3



Geekzone Live »

Try automatic live updates from Geekzone directly in your browser, without refreshing the page, with Geekzone Live now.

Are you subscribed to our RSS feed? You can download the latest headlines and summaries from our stories directly to your computer or smartphone by using a feed reader.

Alternatively, you can receive a daily email with Geekzone updates.