Introduction
For better Search Engine Optimization (SEO) your content should be unique within the Internet. Please note that all the following URLs
- http://www.mycompany.com
- http://mycompany.com
- https://www.mycompany.com
- https://mycompany.com
are considered as different. In particular they should return different content or they should be redirected to preserve uniqueness of the content. Permanent redirect (301) is the right way how to tell search engines that two different urls lead to the same web page.
Let's assume that you have created a CDN service with Origin Domain www.mycompany.com and Service Domain static.mycompany.com. Your HTML pages are then accessible via both domains www.mycompany.com and static.mycompany.com. This is not good for your SEO as it leads to duplicate content.
Block Crawlers option
The Block Crawlers option of both CDN Static and CDN Static Push services allows you to block search engine crawlers (also referred as bots) from indexing your CDN content.
How it works
When you enable Block Crawlers option then a new file robots.txt will automatically appear on the following CDN URL.
http[s]://static.mycompany.com/robots.txt
With the following content.
User-agent: * Disallow: /
This will ensure that all search engine bots will be blocked from indexing your CDN content.
Allowing (for example) Googlebot Images
Please note that when you enable Block Crawlers option then none of your CDN URLs will be indexed by search engine bots. In particular if you use CDN for your images then your images will not be indexed by image search engine bots. In most cases it is nothing you should be worried about but if you need to allow (for example) Googlebot Images to index your CDN images then please follow these steps.
Introduction
We are going to update your web server configuration for www.mycompany.com to respond with different robots.txt to CDN requests than to non-CDN requests.
Create a CDN robots.txt file
Create a /DocumentRoot/robots-cdn.txt with the following content.
User-agent: * Disallow: / User-agent: Googlebot-Image Allow: /
Update configuration on your origin server
On the Services/Settings page you can find your Service Identifier, it is of the format NUMBER.r.cdnsun.net, you will need that NUMBER in this step. Add the following to your web server configuration for www.mycompany.com.
Apache virtual host
RewriteEngine On SetEnvIf X-Resource "NUMBER" IS_CDNSUN_REQUEST="yes" RewriteCond %{ENV:IS_CDNSUN_REQUEST} "yes" RewriteRule "/robots.txt" "/robots-cdn.txt" [L]
Apache .htaccess
RewriteEngine On SetEnvIf X-Resource "NUMBER" IS_CDNSUN_REQUEST="yes" RewriteCond %{ENV:IS_CDNSUN_REQUEST} "yes" RewriteRule "robots.txt" "robots-cdn.txt" [L]
Nginx
location = "/robots.txt" { if ($http_x_resource = "NUMBER") { rewrite "/robots.txt" "/robots-cdn.txt" break; } }
Please replace NUMBER with the CDN service number obtained from its Service Identifier.
Restart or reload your web server if necessary.
Test your origin server
You can use command line curl tool to test your origin web server. You should see CDN robots.txt if and only if you include the X-Resource HTTP header to your request.
Request "normal" robots.txt
curl http://www.mycompany.com/robots.txt
Request CDN robots.txt
curl --header 'X-Resource: NUMBER' http://www.mycompany.com/robots.txt
Please replace NUMBER with the CDN service number obtained from its Service Identifier.
Disable Block Crawlers
On the Services/Settings page make sure that the Block Crawlers option is set to Disabled. If it is enabled then disable it and wait for propagation of the new CDN service settings and then check your new robots.txt on the following CDN URL.
http[s]://static.mycompany.com/robots.txt
Notes
- If you don't not see the new content then you might need to purge your old robots.txt file from the CDN cache.
- Please note that the above steps for Googlebot Images are analogical for all search engine bots.