Infomation At A Glance

This blog is only for sharing various information

Prevent Specific Spiders From Crawling Our Pages

Posted by Dhrub Raaj on August 5, 2010


There are three different ways of blocking spiders. Before we start, however, you’ll need some fundamental data to work from in order to identify specific spiders reliably. These are mainly the User Agent header field (a.k.a., identifier) and, in the case of Copyscape, the spider’s originating IP address.

Basic Spider Data: User Agents

Yandex (RU)
Russian search engine Yandex features the following User Agents:

Mozilla/5.0 (compatible; YandexBlogs/0.99; robot; B; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexBot/3.0; MirrorDetector; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexMedia/3.0; +http://yandex.com/bots)
YandexSomething/1.0

Goo (JP)
Japanese search engine Goo features the following User Agents:

DoCoMo/2.0 P900i(c100;TB;W24H11) (compatible; ichiro/mobile goo; +http://help.goo.ne.jp/help/article/1142/)
ichiro/2.0 (http://help.goo.ne.jp/door/crawler.html)
moget/2.0 (moget@goo.ne.jp)

Naver (KR)
Korean search engine Naver features the following User Agents:

Mozilla/4.0 (compatible; NaverBot/1.0; http://help.naver.com/customer_webtxt_02.jsp)

Baidu (CN)
China’s number-one search engine Baidu features the following User Agents:

Baiduspider+(+http://www.baidu.com/search/spider.htm)
Baiduspider+(+http://www.baidu.jp/spider/)

SoGou (CN)
Chinese search engine SoGou features the following User Agents:

Sogou Pic Spider/3.0( http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou head spider/3.0( http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou Orion spider/3.0( http://www.sogou.com/docs/help/webmasters.htm#07)
Sogou-Test-Spider/4.0 (compatible; MSIE 5.5; Windows 98)
sogou spider
Sogou Pic Agent

Youdao (CN)
Chinese search engine Youdao (which also spells itself “Yodao” on occasion) features the following User Agents:

Mozilla/5.0 (compatible; YoudaoBot/1.0; http://www.youdao.com/help/webmaster/spider/; )
Mozilla/5.0 (compatible;YodaoBot-Image/1.0;http://www.youdao.com/help/webmaster/spider/;)

Majestic-SEO
Link analysis service Majestic-SEO http://www.majesticseo.com/ is using the distributed search engine Majestic-12:

Majestic-12
UA: Mozilla/5.0 (compatible; MJ12bot/v1.3.3; http://www.majestic12.co.uk/bot.php?+)

Copyscape
Copyscape Plagiarism Checker – Duplicate Content Detection Software
Site info: http://www.copyscape.com

Copyscape
User Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
IP: 212.100.254.105
Host: googlealert.com

Copyscape works in an underhanded manner, hiding its spider behind a generic User Agent and a domain name that gives you the entirely false impression of somehow being connected to Google while in reality it belongs to Copyscape itself.

This means that you cannot identify their sneaky spider via the User Agent header field. The only reliable way to block it is via their IP.

Blocking Spiders via robots.txt

For a general introduction to the robots.txt protocol, please see: http://www.robotstxt.org/

Search engines are called to disclose which code to deploy in a given robots.txt file to deny their spiders access to a site’s pages. Moreover, the page outlining this process should be easy to find.

Regrettably, most spiders listed above feature their robots.txt specs only in Chinese, Japanese, Russian, or Korean — not very helpful for your average English speaking webmaster.

The following list features info links for webmasters and the code you should actually deploy to block specific spiders.

Yandex (RU)
Info: http://yandex.com/bots gives us no information on Yandex-specific robots.txt usage.

Required robots.txt code:

User-agent: Yandex
Disallow: /

Goo (JP)
Info (Japanese): http://help.goo.ne.jp/help/article/704/
Info (English): http://help.goo.ne.jp/help/article/853/

Required robots.txt code:

User-agent: moget
User-agent: ichiro
Disallow: /

Naver (KR)
Info: http://help.naver.com/customer/etc/webDocument02.nhn

Required robots.txt code:

User-agent: NaverBot
User-agent: Yeti
Disallow: /

Baidu (CN)
Info: http://www.baidu.com/search/spider.htm

Required robots.txt code:

User-agent: Baiduspider
User-agent: Baiduspider-video
User-agent: Baiduspider-image
Disallow: /

SoGou (CN)
Info: http://www.sogou.com/docs/help/webmasters.htm#07

Required robots.txt code:

User-agent: sogou spider
Disallow: /

Youdao (CN)
Info: http://www.youdao.com/help/webmaster/spider/

Required robots.txt code:

User-agent: YoudaoBot
Disallow: /

Because the robots.txt protocol doesn’t allow for blocking IPs, you’ll have to resort to either of the two following methods to block Copyscape spiders.

Blocking Spiders via .htaccess and mod_rewrite

Seeing that not all spiders are abiding by the robots.txt protocol, it’s safer to block them via .htaccess and mod_rewrite on Apache systems.

Like robots.txt, the .htaccess file applies to single domains only. For a solution covering your entire Web server, please see the section on Apache’s httpd.conf below.

Here’s a simple example for blocking Baidu and Sogou spiders:

In your .htaccess file, include the following code:

RewriteEngine on
Options +FollowSymlinks
RewriteBase /

RewriteCond %{HTTP_USER_AGENT} ^Baiduspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Sogou
RewriteRule ^.*$ – [F]

Explanation:

  1. The various User Agents to be blocked from access are listed one per line.
  2. The Rewrite conditions are connected via “OR”.
  3. “NC”: “no case” – case-insensitive execution.
  4. The caret “^” character stipulates that the User Agent must start with the listed string (e.g. “Baiduspider”).
  5. “[F]” serves the spider a “Forbidden” instruction.

Thus, if you want to block Yandex spiders, for instance, you can use the following code:

RewriteCond %{HTTP_USER_AGENT} Yandex

In this particular case the block will be effected whenever the string “Yandex” occurs in the User Agent identifier.

As mentioned above, Copyscape can only be blocked via their IP. The specific code is:

RewriteCond %{REMOTE_ADDR} ^212.100.254.105$

Blocking Spiders via the Apache Configuration File httpd.conf

An alternative method of blocking spiders can be executed from the Apache webserver configuration file by listing the pertinent User Agent header fields there. The main advantage of this approach is that it will apply to the entire server (i.e., it’s not limited to single domains). This can save you lots of time and effort, provided you actually wish to apply these spider blocks uniformly across your entire system.

Include your new directives in the following section of Apache’s httpd.conf file:

# This should be changed to whatever you set DocumentRoot to.
#

SetEnvIfNoCase User-Agent “^Baiduspider” bad_bots
SetEnvIfNoCase User-Agent “^Sogou” bad_bots
SetEnvIf Remote_Addr “212\.100\.254\.105″ bad_bot

Order allow,deny
Allow from all

Deny from env=bad_bots

By Ralph Tegtmeier, S.E. Watch

About these ads

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: