Geekzone: technology news, blogs, forums
Guest
Welcome Guest.
You haven't logged in yet. If you don't have an account you can register now.




Human
2931 posts

Uber Geek

Subscriber

# 142589 18-Mar-2014 09:51
Send private message

We're running a project at work, part of this requires us to copy over 12 million files and the same if not more in folders, over to another machine.

We want to do verification that all the files are there,i.e. compare the prod system to the new system we are implementing.

I've tried using an MD5 checker, which eventually bombed out because there are just so many files and folders.


I'm trying a method I wrote in powershell to check that all the files are the same and exist (note, not MD5 check or anything), we are just wanting to find out where/if we are missing gaps, but I've had it running for over 24 hours and it's only scanned 3.5 days of each year, across 8 years worth of data. so it's going to take AGGGGES.



Md5 would be nice, but it's the highest priority at this stage, but would like it to come later.


Any ideas on what I may be able to do? 





View this topic in a long page with up to 500 replies per page Create new topic
 1 | 2
3344 posts

Uber Geek

Trusted
Vocus

  # 1007902 18-Mar-2014 09:55
Send private message

Can you do a dump of filename, size, and MD5 on each machine, then compare them?  May be quicker this way than doing the whole thing from one end.

3496 posts

Uber Geek

Trusted

  # 1007927 18-Mar-2014 10:33
One person supports this post
Send private message

Are you sure a filesystem is the best way to store that many files? Object orientated would be better?




Speedtest 2019-10-14


 
 
 
 




Human
2931 posts

Uber Geek

Subscriber

  # 1007930 18-Mar-2014 10:37
Send private message

Hi guys,

ubergeeknz: Can you do a dump of filename, size, and MD5 on each machine, then compare them?  May be quicker this way than doing the whole thing from one end.


The powershell script I wrote just pulls the relative file/folder path in plain text and puts it in a text file. Unfortunately the paths can be quite long, even relative, so this isn't working quite as well as I had hoped.
Adding file size and MD5 would probably kill it faster :) 

Zeon: Are you sure a filesystem is the best way to store that many files? Object orientated would be better?


You're dead right, but it's not my choice. It's a technical limitation of the software the files are for. The developers decided that was the way to do it when they developed it. So this is what we are stuck with.





3344 posts

Uber Geek

Trusted
Vocus

  # 1007936 18-Mar-2014 10:42
Send private message

Surely regardless of the number of files it is just a matter of time to process them all?  What kind of limitation are you running into?



479 posts

Ultimate Geek

Subscriber

  # 1007938 18-Mar-2014 10:43
Send private message

I use Beyond Compare quite often to compare specific files, but it had a file\folder match too.
Not sure how it would handle that many files (what the limitations of the trial are) but might be worth a shot?

http://www.scootersoftware.com/moreinfo.php




Speedtest



Human
2931 posts

Uber Geek

Subscriber

  # 1007942 18-Mar-2014 10:44
Send private message

ubergeeknz: Surely regardless of the number of files it is just a matter of time to process them all?  What kind of limitation are you running into?




EDIT: Sorry, just saw you might be speaking about the actual software its for. It's for a call recording suite called Verint. Unfortunately we are on an old version, so they may have changed it in later versions on how they store files, but for our version it's just files in a folder structure, then a database for lookup if I remember correctly for the web-end, which contains location data to where the file is located.



Using an Md5 tool, the limitation I ran into was that the tool crashed, which I presume ony happened because there is such a large structure of files and folders. 

Using the powershell method, its made a text file over 500MB now, and still counting; not the easiest to process.
I may just refine my powershell script to split it up into files into 100MB lots, if theres no other way of doing it.





gzt

10908 posts

Uber Geek


  # 1008013 18-Mar-2014 12:08
Send private message

Have a look at your disk and cpu while the process is running. It is likely both are under-utilised. Simple solution. Start multiple instances of your md5 process aiming them at different directories whatever. Repeat until you reach a utilisation limitation above. Network might be included if different box.

 
 
 
 


gzt

10908 posts

Uber Geek


  # 1008016 18-Mar-2014 12:13
Send private message

There are also free and open source that will assist in your aim.

597 posts

Ultimate Geek


  # 1008100 18-Mar-2014 14:06
Send private message

Sounds like you should get a programmer to build you a multithreaded utility in C# that can utilize all CPU's and hardware threads.

It's very easy to do.




Regards
Stefan Andres Charsley

40 posts

Geek


  # 1008105 18-Mar-2014 14:17
Send private message

charsleysa: Sounds like you should get a programmer to build you a multithreaded utility in C# that can utilize all CPU's and hardware threads.

It's very easy to do.


Yeah very easy, you should add a column to the DB with a hash as well so you can verify the files in the future nothing worse than not being able to tell there is 1 file missing or corrupt

Also you could use rsync to copy the files instead

1566 posts

Uber Geek


  # 1008112 18-Mar-2014 14:24
Send private message

Aaroona: We're running a project at work, part of this requires us to copy over 12 million files and the same if not more in folders, over to another machine.

We want to do verification that all the files are there,i.e. compare the prod system to the new system we are implementing.


So if I understand, you want to copy files from Machine A to Machine B, verifying them at the same time?  And only copy where the files don't already exist?

You can just use XCOPY from the command line in Windows, and use /D to only copy new files, and /V to verify the files against the source.  To get from one machine to another just map a network drive, using NET USE.  No need to write a program.  You can even redirect the results into a file:

XCOPY [source] [target] /D /S /Y /V > logfile.txt


3344 posts

Uber Geek

Trusted
Vocus

  # 1008114 18-Mar-2014 14:28
Send private message

Aaroona: Using the powershell method, its made a text file over 500MB now, and still counting; not the easiest to process.


Provided you expect perfectly consistent output from both A and B servers, just hash the resulting file on each server :)

2987 posts

Uber Geek

Lifetime subscriber

  # 1008135 18-Mar-2014 14:59
Send private message

I have 500K files that I need to verify periodically. I use FreeFileSync. It works well, takes about half an hour, and has a bunch of options. It's also cross-platform.


JWR

779 posts

Ultimate Geek


  # 1008152 18-Mar-2014 15:26

I use this for copying and syncing large numbers of files between locations....

http://sourceforge.net/projects/freefilesync/

488 posts

Ultimate Geek

Trusted

  # 1008194 18-Mar-2014 16:16
Send private message

http://arcainsula.co.nz/2013/copying-large-amounts-of-data-with-linux/

I wrote this blog to address that. Although it refers to use under Ubuntu, you can use the same instructions with Cygwin. The files are checksummed on the fly.

It is a stable long existing and supported means of doing that one task - copying large amounts of data, reliably.

 1 | 2
View this topic in a long page with up to 500 replies per page Create new topic



Twitter and LinkedIn »



Follow us to receive Twitter updates when new discussions are posted in our forums:



Follow us to receive Twitter updates when news items and blogs are posted in our frontpage:



Follow us to receive Twitter updates when tech item prices are listed in our price comparison site:





News »

Microsoft New Zealand Partner Awards results
Posted 18-Oct-2019 10:18


Logitech introduces new Made for Google keyboard and mouse devices
Posted 16-Oct-2019 13:36


MATTR launches to accelerate decentralised identity
Posted 16-Oct-2019 10:28


Vodafone X-Squad powers up for customers
Posted 16-Oct-2019 08:15


D Link ANZ launches EXO Smart Mesh Wi Fi Routers with McAfee protection
Posted 15-Oct-2019 11:31


Major Japanese retailer partners with smart New Zealand technology IMAGR
Posted 14-Oct-2019 10:29


Ola pioneers one-time passcode feature to fight rideshare fraud
Posted 14-Oct-2019 10:24


Spark Sport new home of NZC matches from 2020
Posted 10-Oct-2019 09:59


Meet Nola, Noel Leeming's new digital employee
Posted 4-Oct-2019 08:07


Registrations for Sprout Accelerator open for 2020 season
Posted 4-Oct-2019 08:02


Teletrac Navman welcomes AI tech leader Jens Meggers as new President
Posted 4-Oct-2019 07:41


Vodafone makes voice of 4G (VoLTE) official
Posted 4-Oct-2019 07:36


2degrees Reaches Milestone of 100,000 Broadband Customers
Posted 1-Oct-2019 09:17


Nokia 1 Plus available in New Zealand from 2nd October
Posted 30-Sep-2019 17:46


Ola integrates Apple Pay as payment method in New Zealand
Posted 25-Sep-2019 09:51



Geekzone Live »

Try automatic live updates from Geekzone directly in your browser, without refreshing the page, with Geekzone Live now.


Support Geekzone »

Our community of supporters help make Geekzone possible. Click the button below to join them.

Support Geezone on PressPatron



Are you subscribed to our RSS feed? You can download the latest headlines and summaries from our stories directly to your computer or smartphone by using a feed reader.

Alternatively, you can receive a daily email with Geekzone updates.