Setting a Block Crawlers - avoid duplicate content (SEO)

Introduction

For better Search Engine Optimization (SEO) your content should be unique within the Internet. Please note that the following URLs

https://www.mycompany.com
https://mycompany.com

are considered as different. In particular they should return different content or they should be redirected to preserve uniqueness of the content. Permanent redirect (301) is the right way how to tell search engines that two different urls lead to the same web page.

Let's assume that you have created a CDN service with Origin Domain www.mycompany.com and Service Domain static.mycompany.com. Your HTML pages are then accessible via both domains www.mycompany.com and static.mycompany.com. This is not good for your SEO as it leads to duplicate content.

Block Crawlers option

The Block Crawlers option of both CDN Static and CDN Static Push services allows you to block search engine crawlers (also referred as bots) from indexing your CDN content.

How it works

When you enable Block Crawlers option then a new file robots.txt will automatically appear on the following CDN URL.

https://static.mycompany.com/robots.txt

With the following content.

User-agent: *  
Disallow: /

This will ensure that all search engine bots will be blocked from indexing your CDN content.

Allowing (for example) Googlebot Images

Please note that when you enable Block Crawlers option then none of your CDN URLs will be indexed by search engine bots. In particular if you use CDN for your images then your images will not be indexed by image search engine bots. In most cases it is nothing you should be worried about but if you need to allow (for example) Googlebot Images to index your CDN images then please follow these steps.

Introduction

We are going to update your web server configuration for www.mycompany.com to respond with different robots.txt to CDN requests than to non-CDN requests.

Create a CDN robots.txt file

Create a /DocumentRoot/robots-cdn.txt with the following content.

User-agent: *
Disallow: /

User-agent: Googlebot-Image
Allow: /

Update configuration on your origin server

On the Services/Settings page you can find your Service Identifier, it is of the format NUMBER.r.cdnsun.net, you will need that NUMBER in this step. Add the following to your web server configuration for www.mycompany.com.

Apache virtual host

RewriteEngine On
SetEnvIf X-Resource "NUMBER" IS_CDNSUN_REQUEST="yes"
RewriteCond %{ENV:IS_CDNSUN_REQUEST} "yes"
RewriteRule "/robots.txt" "/robots-cdn.txt" [L]

Apache .htaccess

RewriteEngine On
SetEnvIf X-Resource "NUMBER" IS_CDNSUN_REQUEST="yes"
RewriteCond %{ENV:IS_CDNSUN_REQUEST} "yes"
RewriteRule "robots.txt" "robots-cdn.txt" [L]

Nginx

location = "/robots.txt"
{
    if ($http_x_resource = "NUMBER") 
    { 
        rewrite "/robots.txt" "/robots-cdn.txt" break;
    }       
}

Please replace NUMBER with the CDN service number obtained from its Service Identifier.

Restart or reload your web server if necessary.

Test your origin server

You can use command line curl tool to test your origin web server. You should see CDN robots.txt if and only if you include the X-Resource HTTP header to your request.

Request "normal" robots.txt

curl https://www.mycompany.com/robots.txt

Request CDN robots.txt

curl --header 'X-Resource: NUMBER' https://www.mycompany.com/robots.txt

Please replace NUMBER with the CDN service number obtained from its Service Identifier.

Disable Block Crawlers

On the Services/Settings page make sure that the Block Crawlers option is set to Disabled. If it is enabled then disable it and wait for propagation of the new CDN service settings and then check your new robots.txt on the following CDN URL.

https://static.mycompany.com/robots.txt

Notes

If you don't not see the new content then you might need to purge your old robots.txt file from the CDN cache.
Please note that the above steps for Googlebot Images are analogical for all search engine bots.

Setting a Block Crawlers - avoid a duplicate content (SEO)

Introduction

Block Crawlers option

How it works

Allowing (for example) Googlebot Images

Introduction

Create a CDN robots.txt file

Update configuration on your origin server

Apache virtual host

Apache .htaccess

Nginx

Test your origin server

Request "normal" robots.txt

Request CDN robots.txt

Disable Block Crawlers

Notes

Contact Us