Rogue Crawler

, posted: 13-Oct-2008 15:29

One of the sites I look after had a massive bandwidth spike the last two days, so I figured I better take a look, what I found was tens of thousands of requests from a crawler which had got it self in a recursive loop. - - [12/Oct/2008:11:50:20 -0700]
"GET /business-directory/2/40/application/x-shockwave-flash/text/javascript/application/x-shockwave-flash/text/javascript/text/javascript/application/x-shockwave-flash/text/javascript HTTP/1.0" 200 21360 "http://www.***.com/business-directory/2/40/application/x-shockwave-flash/text/javascript/application/x-shockwave-flash/text/javascript/text/javascript/application/x-shockwave-flash/text/javascript/" "Mozilla/5.0 (compatible; itsapic.com_crawler/0.01 +;"
The crawler obviously does some sort of bad regex look for things that "might be" urls, in this case it picked type="application/x-shockwave-flash" and type="text/javascript" and figured it might just be a URL.

Unfortunately rewrite rules on the site meant that
was a valid URL, so returned the appropriate html.

So nice little looping going on there consuming nearly 2 gig in the last 2 days.

According to the crawler's info page: crawler is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.

I don't call hammering a site day and night "measured, adaptive pace" somehow.

Welcome to the IP ban-list.

Other related posts:
Xero vs. Quickbooks, from a Quickbooks User
Vodafone Website Failure Fails
CSS namespacing, somebody tell me what I'm doing wrong.

Add a comment

Please note: comments that are inappropriate or promotional in nature will be deleted. E-mail addresses are not displayed, but you must enter a valid e-mail address to confirm your comments.

Are you a registered Geekzone user? Login to have the fields below automatically filled in for you and to enable links in comments. If you have (or qualify to have) a Geekzone Blog then your comment will be automatically confirmed and placed in the moderation queue for the blog owner's approval.

Your name:

Your e-mail:

Your webpage:

sleemanj's profile

James Sleeman
New Zealand

PHP Programmer Extraordinaire

All views expressed are held by the poster, not necessarily any person or organisation associated therewith.