Key to part of your on-site optimisation, is the robots.txt.file. Often overlooked, we advise that this is essential in the relationship search engines have with your site. This file alone can be more than a few bytes and so is well worth including in your optimisation strategy. The robots.txt.file can usually be found in your root directory and its purpose is to regulate the bots that crawl your site. It’s here that you un grant or deny permission to all or some specific search engine robots to access certain pages, or your site as a whole. Developed in 1994, it’s known as the Robots Exclusion Standard/Protocol.
More info here: http://www.robotstxt.org/
The rules of the Robots Exclusion Standard are loose and the is no official body that governs this. There are commonly used elements which are listed below:
User-agent: This refers to the specific bots the rules apply to
Disallow referring to the site areas the bot specified by the user-agent is not supposed to crawl
Allow: Used instead of or in addition to the above, with the opposite meaning
The robots.txt.file often mention the location of the sitemap and whilst most existing search bots – including those belonging to the main search engines – translate and understand the above elements, not all stick to the rules! Also as with everything, certain cases that fall outside of this:
While ‘’Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.”
This indicates that Google can still index other pages, even if they are blocked in robots.txt.file. Below is the link to Google support and a guide as well as a further how to section.
To set access rules for a specific robot, e.g. Googlebot, the user-agent needs to be defined accordingly:
In the above example, Googlebot is denied access to the /images/ folder of a site. Additionally, a specific rule can be set to explicitly disallow access to all files within a folder:
The wildcard in this case refers to all files within the folder. But robots.txt can be even more flexible and define access rules for a specific page:
– or a certain filetype:
If a site uses parameters in URLs and they result in pages with duplicate content, you can opt-out of indexing them by using a corresponding rule, something like:
The above means do not crawl any URLs with ‘?’ in them and this is often the way that you see parameters included in URLs.
With such an extensive set of commands its easy to see that this can be tricky for both website owners and webmasters alike and how mistakes, which can be costly, are made.
Common robots.txt mistakes:
There are some mistakes that are easy to spot and are listed below
No robots.txt file at all
Having no robots.txt file for your site means it is completely open for any spider to crawl. If you have a simple static site with minimal pages and nothing you wish to hide, this may not be an issue, but it’s likely you are running a CMS.
No CMS is perfect, and the chances are there are indexable instances of duplicate content because of the same articles being accessible via different URLs, as well as backend stuff not intended for visitors to site.
This can also be problematic and as well as including the above issues, depending on the CMS used on the site, both cases also bear a risk of URLs like the below example getting indexed:
This can expose your site to potentially being indexed in the context of a bad neighborhood. (the actual domain name has of course been replaced but the domain where this specific type of URLs being indexable had an empty robots.txt file)
Default robots.txt allowing to access everything
Robots.txt file showing like in the below example:
Or like this:
As in the two cases prior, you are leaving your site completely unprotected and there is little point in having a robots.txt file like this at all, unless, again, you are running a static minimal page and don’t want to hide anything on the server.
Best practice is not to mislead the search engines, if your sitemap.xml file contains URLs explicitly blocked by your robots.txt, this is a contradiction. This can often happen if your robots.txt and /or sitemap.xml files are generated by different automated tools and not checked manually afterward.
It’s easy to see this using Google Webmaster Tools. To do this, you will need to have added your site to Google Webmaster Tools, verified it, and submitted an XML sitemap for it. From here you can see a report on crawling the URLs submitted via the sitemap in the Optimization > Sitemaps section of google webmaster tools.
Blocking access to sensitive areas with robots.txt
If there are areas of your site that require to be blocked, password protect them. DO NOT do this with robots.txt.
Remembering that robots.txt is a recommendation, not a mandatory set of rules which means that anyone not following the protocol could still access that area as could rogue bots. The best rule of thumb is, if it needs to be 100% private, best not to put it online! It has been known that the SEO community have discovered projects that have not yet been released to the public from Google by looking at their robots.txt
Whilst lots to consider with robots.txt and it being a key part of your optimisations, we hope that this is useful when looking at the best course for action. There is still some fun to be had with robots.txt however and for a more light-hearted look at this, the link below provides some great reading!