On badly behaved crawlers

For the last few weeks my main system has had to be rebooted on a daily, and sometimes twice daily basis. Having other things on my mind I didn't bother getting too fussed, thinking perhaps I had some dodgy RAM or a disk on the way out - both things I could live with until they finally became terminal. As I started to look at the pattern of failure though, it started to look like there was a deliberate denial of service attack on the site. So I began active monitoring as well as log analysis.

Now I have no problem with search engines crawling my site. Mostly (apart from some well known exceptions) they behave themselves, abide by the robots.txt rules, and don't flood the system with requests. Above all, they perform a useful function to both me and my readers. I do have a problem with a crawler that I have identified as the cause of the system failures.

Brandwatch is apparantly involved in "Online Reputation Management and Brand Tracking in Social Media". They have a crawler that will just swamp a reasonably small site. No pacing, no maximum number of requests per second, just blast away at the fastest possible rate they can and if you happen to be the target then you'd better be ready for a connection storm. If you are having problems and see the string 'magpie-crawler' in your logs, then you may well be one of their victims. Worse, they perform no useful function for me or my readers. They sell information to their clients on who is talking about them. That's right, they are trawling my site in order to sell information to someone who is worried I might be saying something negative. Well I am. Brandwatch is a menace. And they need to get their crawler fixed. Until they do I have blocked 80.82.139.128/27 from my servers and I'd urge others to do the same until they wake up. What is particularly galling is that on their blog they have a piece on how bad Google Search is. Sheesh, at least Google's crawlers are well behaved and don't interfere with the running of the site.

Not happy, Jan!

* P.S. - No I haven't put in URLs. I don't want to advertise them any more than they deserve.

11 comments

Comment from: Fabrice [Visitor]  
Fabrice

Hello Jan,

Sorry if our crawler has been a problem. I checked our system if we have your blog’s feed in our records indeed, but the crawler shouldn’t have visited it more than twice a day, and it only looks for new pages. Maybe you’re referring to a different site?

As you said the purpose of our service is to highlight what people are saying about our clients and their competitors. Our clients in turn quite often try to engage with authors to clarify issues, run advertising campaigns, etc.

Regards
Fabrice

29/10/08 @ 19:39
Comment from: [Member]
aj

A few things, Fabrice.

I am not Jan. It was a reference to a well known marketing campaign in Australia.

I never mentioned which of my sites was the target of the crawler. I happen to run quite a few on that particular server. As it happens 178 hits at the same time on one site (austcrimefiction.org) is hardly “twice a day". But it doesn’t matter, I have your IP block banned and it will stay that way.

It doesn’t matter how you try and justify what you do, as it happens you take my intellectual property without my knowledge or consent to make a profit. That at the very least doesn’t sound fair, and if not a breach in law of the copyright act is certainly a breach in intent.

29/10/08 @ 21:27
Comment from: Fabrice [Visitor]  
Fabrice

Hi again,

Thanks for pointing out the domain name, we’re going to look into what happened there - it’s clearly not in our interests to annoy people so we always look into this kind of issues. To date (in more than a year) we have had only two complaints in total.

Cheers
Fabrice

30/10/08 @ 00:09
Comment from: [Member]
aj

Fabrice, this has been going on for some weeks. As I mentioned before I never suspected a crawler responsible for the denial of service, and that may be a common problem. You may want to check your logs for the 19th of October onwards, as this is where the crawler is most virulent.

I’m not particularly interested in the number of complaints you’ve had. It is an irrelevant metric. It doesn’t change the fact that I’ve had to protect my site from your crawler. I have logs showing the time, number of hits, and the correlation with server load. I’d suggest you stop trying to manage perceptions of the problem and just fix it.

30/10/08 @ 09:26
Comment from: Thomas [Visitor]
Thomas

The exact same problem. Their entire range 94.28.34.192/26 has been blocked from all of my servers. Leeches!

26/12/10 @ 23:34
Comment from: Andrew [Visitor]
Andrew

Thanks for the tip, their “crawler” still misbehaves.

12/06/11 @ 01:06
Comment from: SergeiS [Visitor]  
SergeiS

I also had to block access via .htaccess

21/06/11 @ 20:09
Comment from: Dave [Visitor]  
Dave

They’re being a nuisance for me as well. Whole subnet blocked now, for all servers.

15/03/12 @ 21:00
Comment from: Neil [Visitor]
Neil

Hi all

The magpie crawler is also causing us some nasty issues on several large sites. The main issue for us is in fact that they persist in crawling non-existent pages, one of which alone is causing us over 20,000 404s per day.

We are taking steps now to ban magpie from our servers.

Cheers
Neil

24/08/12 @ 17:51
Comment from: Neil [Visitor]
Neil

Just an update…We applied the robots.txt fix and noticed a distinct drop in origin traffic and CPU usage.

Not all magpie-crawler connections have gone but most have. We will only block the IPs as the last resort but it may come to that if we can’t get rid of them all via robots.txt.

Seems magpie is (as stated above) simply overly aggressive.

10/09/12 @ 19:36
Comment from: Steve [Visitor]
Steve

Just to mention that we are now 2014 and magpie still hasn’t improved. My advice: ban it from your site right away. I get about 3 - 5 hits a second from that piece of junk when they decide to index my site and I’m tired of paying for their incompetence.

01/12/14 @ 19:21