robots.txt history

Web robots, also known as web crawlers or web spiders, are programs that search engines and similar applications use to roam the web automatically to index its content. In the early 90s there were occasions where robots visited servers where they weren’t welcome for a number of reasons:

•    swamping servers with requests reducing performance
•    retrieving the same files repeatedly
•    indexing duplicated content or temporary information
•    scanning for email address to send spam

These incidents highlighted a need to develop the means to restrict what the robots could index.

This led to the creation of a file on your server which dictates where the robots can’t go. The file has to be accessible via HTTP on the local URL “/robots.txt”, this can be easily implemented and the robot can find tell where it’s not allowed to go by retrieving one single document.

robots.txt checker

To ensure there are no errors in your robots.txt file, instead of wading through lines of html code, you can use this handy robots.txt checker tool from motoricera, an Italian non-profit SEO site.

In their own word: ‘This robots.txt checker is a “validator” that analyzes the syntax of a robots.txt file to see if its format is valid as established by Robot Exclusion Standard (please read the documentation and the tutorial to learn the basics) or if it contains errors. The validation process takes in account both Robots Exclusion Standard rules and spider-specific (Google, Inktomi, etc.) extensions (including the new “Sitemap” command).’

While this is a neat little tool that will save you some time checking your robots.txt file, if it does uncover errors that could mean your site is not being correctly indexed and could be affecting your presence in the web. Unless you know your HTML and regexp I wouldn’t recommend fiddling with the file yourself – rather, get in touch with a professional.

