4ps Marketing SEO Agency Strategic Digital Marketers

Call us today on: 0207 607 5650

  • 4ps Blog



Home Blog Robots.txt – A Rough Guide
30 July 2010

Robots.txt – A Rough Guide

Right, given that the title of this blog post is not being of good help as to what a robots.txt may be, here’s the definition.

Robots.txt can be defined as a simple .txt file that is indented to contain access information for web crawlers that visit your website. These web crawlers are deployed by many search software vendors and well known web search engines. For example, Google’s web crawler is named ‘googlebot’ and Yahoo’s web crawler is named ‘Slurp’.

What is the importance of robots.txt?

One word, HIGH. Robots.txt is a very light file that sits in the root of your website’s hosting server. It tells the bots which information can and cannot be accessed on your website. This is the direct advantage. A bigger indirect advantage is that it helps increasing your website performance in terms of page load speed. And the page loading speed is an important factor in SEO.

How??

Well, when web crawlers are en-route to visit your site, they are actually using a part of the bandwidth of your hosting server. Imagine you driving in a road to website, only you are a web crawler. The bandwidth is mostly meant for transfer of information and not only for web crawlers. If you have tons of web crawlers (which actually exist!) accessing your website and clogging your bandwidth, the crawler traffic to your website increases exponentially. And, most web crawlers make it a habit in a way to visit your website frequently, thereby hogging the bandwidth. The reasons could be to index your web pages to a valid search engine which is good, but the bad part is there are more crawlers with intentions of scanning for email address to send spam, looking up to post links on your content automatically or worst hack your website. By limiting access to the many unwanted crawlers, you can essentially improve the page loading speed as there would be more bandwidth left for transfer of data. Imagine a freeway, only the cars are data packets!
Writing a Robots.txt is very easy as well!
1. Just open the notepad
2. define the access rights
3. save as robots.txt
4. Upload the file on to the root directory where your website is hosted.
5. You should be able to see the file at www.yourwebsite.com/robots.txt

Here’s an example,
robots.txt example

User-agent – name of the web crawler
Disallow – the folder or file you do not want to be crawled
* – all the web’s crawlers
It is assumed that all the files and folders are allowed access unless made ‘Disallow’. You can also specify the location of your XML sitemap for the crawlers to help indexing your web pages.

Drawbacks of Robots.txt

• Every crawler when it first visits sees the URL, strips the request URI and replaces it with ‘/robots.txt’ to see the access permissions, but not all the web crawlers respect this and adhere to access rights defined in robots.txt.
A better way to completely deny access is by configuring the .htaccess file on your server if Apache or modifying setting on IIS for Windows Server.
• It is accessible from URL, so human visitors would easily know which files you do not intend being crawled.

Overall, It is definitely a benefit to have a robots.txt in place specially when the cost is so low.

  • No Related Post

Save on Delicious

Written by Hannah Miller

As Digital Group Account Director, Hannah manages and trains a team of SEO experts within the agency. She is our technical lead for SEO and ensures we are on top of even the smallest changes in search engine algorithms.

More about Hannah Miller

Leave a Reply

 

Subscribe to BlogGet all of our latest industry tips by email

Newsletter



  • Related Posts

      • No Related Post

    Newsletter