Robots.txt and SEO
Robots.txt for WordPress website
Robots.txt – General information
Robots.txt is a text file located in the site’s root directory that specifies for search engines’ crawlers and spiders what website pages and files you want or don’t want them to visit. Usually site owners strive to be noticed by search engines, but there are cases when it’s not needed: for instance, if you store sensitive data or you want to save bandwidth by not indexing excluding heavy pages with images.
When a crawler accesses a site, he requests for a file named ‘/robots.txt’ in the first place. If such a file is found, the crawler checks it for the website indexation instructions.
NOTE: there can be only one robots.txt file for the website. Robots.txt file for addon domain needs to be placed in the corresponding document root.
Google’s official stance on the robots.txt file
Robots.txt file consists of lines which contain two fields: line with a user-agent name (search engine crawlers) and one or several lines starting with the directive
Disallow:
Robots.txt has to be created in UNIX text format.
Basics of robots.txt syntax
Usually robots.txt file contains something like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~different/
In this example three directories: ‘/cgi-bin/’, ‘/tmp/’ and ‘/~different/’ are excluded from indexation.
NOTE: every directory is written on a separate line. You can’t write ‘Disallow: /cgi-bin/ /tmp/’ in one line, nor can you break up one directive Disallow or User-agent in several lines – use a new line to separate directives from each other.
‘Star’ (*) in User-agent field means ‘any web crawler’. Consequently, directives of the type ‘Disallow: *.gif’ or ‘User-agent: Mozilla*’ are not supported – please pay attention to such logical mistakes as they are most common ones.
Other common mistakes are typos – misspelled directories, user-agents, missing colons after User-agent and Disallow, etc. When your robots.txt files get more and more complicated, and it’s easy for an error to slip in, there are some validation tools that come in handy: http://tool.motoricerca.info/robots-checker.phtml
Examples of usage
Here are some useful examples of robots.txt usage:
Prevent the whole site from indexation by all web crawlers:
User-agent: *
Disallow: /
Allow all web crawlers to index the whole site:
User-agent: *
Disallow:
Prevent only several directories from indexation:
User-agent: *
Disallow: /cgi-bin/
Prevent site’s indexation by a specific web crawler:
User-agent: Bot1
Disallow: /
Allow indexation to a specific web crawler and prevent indexation from others:
User-agent: Opera 9
Disallow:
User-agent: *
Disallow: /
Prevent all the files from indexation except a single one.
This is quite difficult as directive ‘Allow’ doesn’t exist. Instead you can move all the files to a certain subdirectory and prevent its indexation except one file that you allow to be indexed:
User-agent: *
Disallow: /docs/
You can also use online robots.txt file generator here.
Removing exclusion of images
The default robots.txt file in some CMS versions is set up to exclude your images folder.This issue doesn’t occur in the newest CMS versions, but the older versions need to be checked.
This exclusion means your images will not be indexed and included in Google Image Search, which is something you would want, as it increases your SEO rankings.
Should you want to change this, open your robots.txt file and remove the line that says:
Disallow: /images/
Adding reference to your sitemap.xml file
If you have a sitemap.xml file (and you should have it as it increases your SEO rankings), it will be good to include the following line in your robots.txt file:
sitemap:http://www.domain.com/sitemap.xml
(This line needs to be updated with your domain name and sitemap file).
Miscellaneous remarks
WordPress creates a virtual file robots.txt file once you publish your first post with WordPress. Though if you already have a real robots.txt file created on your server, WordPress won’t add a virtual one.
Virtual robots.txt doesn’t exist on the server and you can only access it via such a link:http://www.yoursite.com/robots.txt
By default it will have Google’s Mediabot allowed, a bunch of spam-bots disallowed, and some standard WordPress folders and files disallowed.
So in case you didn’t create yet a real robots.txt, create one with any text editor and upload it to the root directory of your server via FTP.
Blocking main WordPress Directories
There are 3 standard directories in every WordPress installation – wp-content, wp-admin, wp-includes that don’t need to be indexed.
Don’t choose to disallow the whole wp-content folder though, as it contains an ‘uploads’ subfolder with your site’s media files that you don’t want to be blocked. That’s why you need to proceed as follows:
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/
Blocking on the basis of your site structure
Every blog can be structured in various ways:
a) On the basis of categories
b) On the basis of tags
c) On the basis of both – None of those
d) On the basis of Date-based archives
a) If your site is Category structured, you don’t need to have the Tag archives indexed. Find your tag base in the Permalinks options page under Settings menu. If the field is left blank, the tag base is simply ‘tag’:
Disallow: /tag/
b) If your site is Tag structured, you need to block the Category archives. Find your Category base and use the following directive:
Disallow: /category/
c) If you use both Categories and Tags, you don’t need to use any directives. In case you use none of them, you need to block both of them:
Disallow: /tags/
Disallow: /category/
d) If your site is structured on the basis of Date-based archives, you can block those in the following ways:
Disallow: /2010/
Disallow: /2011/
Disallow: /2012/
Disallow: /2013/
NOTE: you can’t use Disallow: /20*/ here as such a directive will block every single blog post or page that starts with the number ’20’.
Duplicate content issues in WordPress
By default WordPress has duplicate pages which does no good to your SEO rankings. To repair it we would advise you not to use robots.txt, but instead to go with a subtler way: ‘rel = canonical’ tag that you use to place the only correct canonical URL in the section of your site. This way web crawlers will only crawl the canonical version of a page.
That’s it!