Ru -
Joined: 10 Apr 2007 Posts: 10
|
Posted: Tue Nov 05, 2013 6:47 pm Post subject: Disabling search engines from crawling the site |
|
|
I am getting tons of logs that came from search engines e.g baidu, yahoo, bing, and so on. Is there any way to filter them out. Just to leave google for example. |
|
Axis -
Joined: 29 Sep 2003 Posts: 336
|
Posted: Wed Nov 06, 2013 6:21 pm Post subject: |
|
|
Baidu, Yahoo and Bing will obeys the robots.txt protocol. But they are just the tip of the iceberg. Their is a never ending parade of new or "special" or whatever bots and you will find, if you don't already know, that the three listed here are the least problematic. More of a problem are bots "probing" for vulnerabilities.
Here is part of one of my robots.txt file:
User-agent: AhrefsBot
Disallow: /
User-agent: sistrix
Disallow: /
User-agent: TurnitinBot
Disallow: /
User-agent: MJ12bot
Disallow: /
Here is another sites robots text that only allows Google, Yahoo and Bing (Bing will still respond to Msnbot though their new user agent is Bingbot):
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow:
User-agent: Yahoo-slurp
Disallow:
User-agent: Msnbot
Disallow:
In the end, it is the rogue robots that cause the most trouble and it is nearly impossible to stop them unless you can confine them to a ip range.
I manged to stop the Ezooms hacker robot
by banning 208.115.111.64/28
and 208.115.96.0/19
But, like I said, it is the tip of the iceberg.
[edit] There is a wonderful perl script called "Guardian" that is basically designed for Apache that is on the surface an error handler, but "underneath" allows you to add strings that show up in 404's for things that are obviously probes looking for a door into your website and write a deny ip address to .htaccess. Unfortunately, as Abyss does not use .htaccess, it won't work on it.
If I was smart enough perhaps it could be hacked to write to persist.data but I am, alas, not that smart:
[EDITED BY ADMIN - OLD LINK THAT NOW REDIRECTS TO A FINANCIAL COMPANY - USED TO BE A SOFTWARE COMPANY] xav.com /scripts/guardian/
Guardian is an invaluable tool on my payed-for Apache site. Right now I have over 350 banned ip's on it. I usually "clean it out" around 500.
Example of Guardian: Since I recoded the above Apache site to HTML5 hackers assume I am using WordPress or some other CMS. I write a string to Guardians "Filter Rules" like this:
==
#php losers
url-substring: wp-login.php
blacklist: /home/myuserid/public_html/.htaccess
==
and bang, their banned. Same with domainname.com/index.php, etc.
Regards,
Axis |
|