On badly behaved crawlers

For the last few weeks my main system has had to be rebooted on a daily, and sometimes twice daily basis. Having other things on my mind I didn't bother getting too fussed, thinking perhaps I had some dodgy RAM or a disk on the way out - both things I could live with until they finally became terminal. As I started to look at the pattern of failure though, it started to look like there was a deliberate denial of service attack on the site. So I began active monitoring as well as log analysis.

Now I have no problem with search engines crawling my site. Mostly (apart from some well known exceptions) they behave themselves, abide by the robots.txt rules, and don't flood the system with requests. Above all, they perform a useful function to both me and my readers. I do have a problem with a crawler that I have identified as the cause of the system failures.

Brandwatch is apparantly involved in "Online Reputation Management and Brand Tracking in Social Media". They have a crawler that will just swamp a reasonably small site. No pacing, no maximum number of requests per second, just blast away at the fastest possible rate they can and if you happen to be the target then you'd better be ready for a connection storm. If you are having problems and see the string 'magpie-crawler' in your logs, then you may well be one of their victims. Worse, they perform no useful function for me or my readers. They sell information to their clients on who is talking about them. That's right, they are trawling my site in order to sell information to someone who is worried I might be saying something negative. Well I am. Brandwatch is a menace. And they need to get their crawler fixed. Until they do I have blocked 80.82.139.128/27 from my servers and I'd urge others to do the same until they wake up. What is particularly galling is that on their blog they have a piece on how bad Google Search is. Sheesh, at least Google's crawlers are well behaved and don't interfere with the running of the site.

Not happy, Jan!

* P.S. - No I haven't put in URLs. I don't want to advertise them any more than they deserve.

7 comments

Comment from: Fabrice [Visitor] Email
FabriceHello Jan,

Sorry if our crawler has been a problem. I checked our system if we have your blog's feed in our records indeed, but the crawler shouldn't have visited it more than twice a day, and it only looks for new pages. Maybe you're referring to a different site?

As you said the purpose of our service is to highlight what people are saying about our clients and their competitors. Our clients in turn quite often try to engage with authors to clarify issues, run advertising campaigns, etc.

Regards
Fabrice
29/10/08 @ 19:39
Comment from: ajdonnison [Member] Email
ajdonnisonA few things, Fabrice.

I am not Jan. It was a reference to a well known marketing campaign in Australia.

I never mentioned which of my sites was the target of the crawler. I happen to run quite a few on that particular server. As it happens 178 hits at the same time on one site (austcrimefiction.org) is hardly "twice a day". But it doesn't matter, I have your IP block banned and it will stay that way.

It doesn't matter how you try and justify what you do, as it happens you take my intellectual property without my knowledge or consent to make a profit. That at the very least doesn't sound fair, and if not a breach in law of the copyright act is certainly a breach in intent.
29/10/08 @ 21:27
Comment from: Fabrice [Visitor] Email
FabriceHi again,

Thanks for pointing out the domain name, we're going to look into what happened there - it's clearly not in our interests to annoy people so we always look into this kind of issues. To date (in more than a year) we have had only two complaints in total.

Cheers
Fabrice
30/10/08 @ 00:09
Comment from: ajdonnison [Member] Email
ajdonnisonFabrice, this has been going on for some weeks. As I mentioned before I never suspected a crawler responsible for the denial of service, and that may be a common problem. You may want to check your logs for the 19th of October onwards, as this is where the crawler is most virulent.

I'm not particularly interested in the number of complaints you've had. It is an irrelevant metric. It doesn't change the fact that I've had to protect my site from your crawler. I have logs showing the time, number of hits, and the correlation with server load. I'd suggest you stop trying to manage perceptions of the problem and just fix it.
30/10/08 @ 09:26
Comment from: Thomas [Visitor]
ThomasThe exact same problem. Their entire range 94.28.34.192/26 has been blocked from all of my servers. Leeches!
26/12/10 @ 23:34
Comment from: Andrew [Visitor]
AndrewThanks for the tip, their "crawler" still misbehaves.
12/06/11 @ 01:06
Comment from: SergeiS [Visitor] Email
SergeiSI also had to block access via .htaccess
21/06/11 @ 20:09

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)
This is a captcha-picture. It is used to prevent mass-access by robots.
Please enter the characters from the image above. (case insensitive)