Rogue Crawler

, posted: 13-Oct-2008 15:29

One of the sites I look after had a massive bandwidth spike the last two days, so I figured I better take a look, what I found was tens of thousands of requests from a crawler which had got it self in a recursive loop. - - [12/Oct/2008:11:50:20 -0700]
"GET /business-directory/2/40/application/x-shockwave-flash/text/javascript/application/x-shockwave-flash/text/javascript/text/javascript/application/x-shockwave-flash/text/javascript HTTP/1.0" 200 21360 "http://www.***.com/business-directory/2/40/application/x-shockwave-flash/text/javascript/application/x-shockwave-flash/text/javascript/text/javascript/application/x-shockwave-flash/text/javascript/" "Mozilla/5.0 (compatible; itsapic.com_crawler/0.01 +;"
The crawler obviously does some sort of bad regex look for things that "might be" urls, in this case it picked type="application/x-shockwave-flash" and type="text/javascript" and figured it might just be a URL.

Unfortunately rewrite rules on the site meant that
was a valid URL, so returned the appropriate html.

So nice little looping going on there consuming nearly 2 gig in the last 2 days.

According to the crawler's info page: crawler is designed to respect the robots.txt exclusion directives and META robots tags, and collect material at a measured, adaptive pace unlikely to disrupt normal website activity.

I don't call hammering a site day and night "measured, adaptive pace" somehow.

Welcome to the IP ban-list.

