For the last few weeks my main system has had to be rebooted on a daily, and sometimes twice daily basis. Having other things on my mind I didn't bother getting too fussed, thinking perhaps I had some dodgy RAM or a disk on the way out - both things I could live with until they finally became terminal. As I started to look at the pattern of failure though, it started to look like there was a deliberate denial of service attack on the site. So I began active monitoring as well as log analysis.
Now I have no problem with search engines crawling my site. Mostly (apart from some well known exceptions) they behave themselves, abide by the robots.txt rules, and don't flood the system with requests. Above all, they perform a useful function to both me and my readers. I do have a problem with a crawler that I have identified as the cause of the system failures.
Brandwatch is apparantly involved in "Online Reputation Management and Brand Tracking in Social Media". They have a crawler that will just swamp a reasonably small site. No pacing, no maximum number of requests per second, just blast away at the fastest possible rate they can and if you happen to be the target then you'd better be ready for a connection storm. If you are having problems and see the string 'magpie-crawler' in your logs, then you may well be one of their victims. Worse, they perform no useful function for me or my readers. They sell information to their clients on who is talking about them. That's right, they are trawling my site in order to sell information to someone who is worried I might be saying something negative. Well I am. Brandwatch is a menace. And they need to get their crawler fixed. Until they do I have blocked 22.214.171.124/27 from my servers and I'd urge others to do the same until they wake up. What is particularly galling is that on their blog they have a piece on how bad Google Search is. Sheesh, at least Google's crawlers are well behaved and don't interfere with the running of the site.
Not happy, Jan!
* P.S. - No I haven't put in URLs. I don't want to advertise them any more than they deserve.