Disabling search engines from crawling the site

 
Post new topic   Reply to topic    Aprelium Forum Index -> General Questions
View previous topic :: View next topic  
Author Message
Ru
-


Joined: 10 Apr 2007
Posts: 10

PostPosted: Tue Nov 05, 2013 6:47 pm    Post subject: Disabling search engines from crawling the site Reply with quote

I am getting tons of logs that came from search engines e.g baidu, yahoo, bing, and so on. Is there any way to filter them out. Just to leave google for example.
Back to top View user's profile Send private message
Axis
-


Joined: 29 Sep 2003
Posts: 336

PostPosted: Wed Nov 06, 2013 6:21 pm    Post subject: Reply with quote

Baidu, Yahoo and Bing will obeys the robots.txt protocol. But they are just the tip of the iceberg. Their is a never ending parade of new or "special" or whatever bots and you will find, if you don't already know, that the three listed here are the least problematic. More of a problem are bots "probing" for vulnerabilities.

Here is part of one of my robots.txt file:
User-agent: AhrefsBot
Disallow: /

User-agent: sistrix
Disallow: /

User-agent: TurnitinBot
Disallow: /

User-agent: MJ12bot
Disallow: /

Here is another sites robots text that only allows Google, Yahoo and Bing (Bing will still respond to Msnbot though their new user agent is Bingbot):
User-agent: *
Disallow: /

User-agent: Googlebot
Disallow:

User-agent: Yahoo-slurp
Disallow:

User-agent: Msnbot
Disallow:

In the end, it is the rogue robots that cause the most trouble and it is nearly impossible to stop them unless you can confine them to a ip range.

I manged to stop the Ezooms hacker robot
by banning 208.115.111.64/28
and 208.115.96.0/19

But, like I said, it is the tip of the iceberg.

[edit] There is a wonderful perl script called "Guardian" that is basically designed for Apache that is on the surface an error handler, but "underneath" allows you to add strings that show up in 404's for things that are obviously probes looking for a door into your website and write a deny ip address to .htaccess. Unfortunately, as Abyss does not use .htaccess, it won't work on it.

If I was smart enough perhaps it could be hacked to write to persist.data but I am, alas, not that smart:
http://www.xav.com/scripts/guardian/
Guardian is an invaluable tool on my payed-for Apache site. Right now I have over 350 banned ip's on it. I usually "clean it out" around 500.

Example of Guardian: Since I recoded the above Apache site to HTML5 hackers assume I am using WordPress or some other CMS. I write a string to Guardians "Filter Rules" like this:
==

#php losers
url-substring: wp-login.php
blacklist: /home/myuserid/public_html/.htaccess

==
and bang, their banned. Same with domainname.com/index.php, etc.

Regards,
Axis
Back to top View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    Aprelium Forum Index -> General Questions All times are GMT + 1 Hour
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB phpBB Group