Every SEO should know their way around the core principles of a robots.txt file. It is the first thing a crawler looks for when it hits a subdomain so getting the basics (and the not-so-basics) spot on is important to ensure you don’t end up with pages showing ineffectually in search results or just dropping out of them altogether.
Your robots.txt file must sit at the root of your subdomain. No negotiation here. What actually happens is that the crawler strips out the path from the URL, which is everything after the first forward slash, but in practical terms this means your robots text should sit at the root.
Put it anywhere else, and crawlers won’t find it, which means you effectively have no robots.txt file on your site. That means, incidentally, that bots will assume they can access everything and so will just go berserk and crawl every inch of the site they can get to. This might be perfectly fine if you have a smaller website – but it can be very risky SEO-wise on a large catalogue or enterprise site where you want to more carefully control crawler behaviour to make sure things are indexed to best effect.
You can create a robots.txt file in any basic text editor, up to and including Notepad. A very basic robots.txt file will look something like this:
User-agent: * Disallow: Sitemap: http://www.website.com/sitemap.xml
The first line uses a wildcard * to mean “any user agent” (so “any robot”), the disallow being blank means nothing on the site is disallowed from crawling, and the sitemap line specifies the location of the XML sitemap (or sitemap index file) for the website so the bot can hop onto it and start indexing from that list. Keeps things nice and efficient!
If you want to stop all bots from indexing content within certain folders – say, an area only accessible to logged-in users, that’s pretty simple to do.
User-agent: * Disallow: /user-area/ Sitemap: http://www.website.com/sitemap.xml
You can also keep robots out from a single page or file if you want.
User-agent: * Disallow: /user-area/ Disallow: /assets/media/invoice-template.pdf Sitemap: http://www.website.com/sitemap.xml
Important Notes On Robots.txt Blocking
It is important to note that blocking things in robots.txt does not prevent them from appearing in search engine results pages altogether. What you may end up seeing in a SERP might well be something like this:
Now for most things this may actually be fine. User areas or invoice templates and so forth – you’re probably not too worried about outlier cases where they show up like this, as long as their full content isn’t being indexed and ranked organically.
In some cases, however, brands may be more sensitive to certain URLs or files and want to ensure they will never show up in a search engine in any shape or form. If this is the case, it is vitally important to ensure that these files are not blocked in robots.txt – the bot will need to crawl the asset thoroughly, not just “ping the URL,” so it can see the robots meta noindex tag or x-robots noindex HTTP header.
It is also critical not to block assets in robots.txt that are needed to render pages in a browser. In the past many developers would mass block things like scripts or CSS folders, but doing this now will result in a grumpy message from Google in Search Console and can have a direct negative impact on your organic visibility levels (Google announced this change in 2014).
Other Important Notes
There are plenty of other elements you might need to know about a robots.txt file. Keep an eye out for some of the following:
- Crawl delays. These were used back in the day to throttle robot access. There’s no reason to have them in a modern setup – and Google ignores crawl delay rules anyway.
- Pattern matching. Both Google and Bing robots will honour rules that make use of * (a wildcard, meaning “any sequence of characters”) and/or $ (which matches the end of a URL).
- The robots.txt file is case sensitive in all senses – don’t call it robots.TXT, for example, and make sure any rules you put in are case matched to the URLs required.
- Only one URL rule can go per line. Three file or folder disavows, for example, must go on three lines.
- Processing order for rules is important! Google and Bing robots both make use of the “most specific rule first” principle, while standard processing order is top to bottom. If in doubt, put any Allows above any Disavows (for example Allow a file in a director before you Disallow the entire directory in order to achieve a “disallow all in this directory except this file” effect).
- Avoid blocking files in robots.txt when you should be using other techniques. Some of the most common problems we see include blocking mobile websites from non-mobile bots or using robots.txt to block duplication caused by internal architecture problems. Make sure you address situations like this with search engine recommended solutions, not just by throwing robots.txt rules in!
- You can add comments (human but not machine-readable notes) to robots.txt files by using # at the beginning of a line.
Remember that while the robots.txt standard is a directive, it is not enforceable. Naughty and malicious bots and crawlers will generally ignore it altogether in favour of whatever they want from your site. Be aware too that the robots.txt file is always public – anyone can see it by going to the /robots.txt URL on your site! Definitely don’t rely on the robots.txt file to keep secure areas of your site “hidden” or “safe” – use the appropriate encryption and login protocols.