4Ps Marketing

An SEO’s Guide To Robots.txt Files

Every SEO should know their way around the core principles of a robots.txt file. It is the first thing a crawler looks for when it hits a subdomain so getting the basics (and the not-so-basics) spot on is important to ensure you don’t end up with pages showing ineffectually in search results or just dropping out of them altogether.

Robots.txt Location

Your robots.txt file must sit at the root of your subdomain. No negotiation here. What actually happens is that the crawler strips out the path from the URL, which is everything after the first forward slash, but in practical terms this means your robots text should sit at the root.

Put it anywhere else, and crawlers won’t find it, which means you effectively have no robots.txt file on your site. That means, incidentally, that bots will assume they can access everything and so will just go berserk and crawl every inch of the site they can get to. This might be perfectly fine if you have a smaller website – but it can be very risky SEO-wise on a large catalogue or enterprise site where you want to more carefully control crawler behaviour to make sure things are indexed to best effect.

Robots.txt Basics

You can create a robots.txt file in any basic text editor, up to and including Notepad. A very basic robots.txt file will look something like this:

User-agent: *
Disallow:
Sitemap: http://www.website.com/sitemap.xml

The first line uses a wildcard * to mean “any user agent” (so “any robot”), the disallow being blank means nothing on the site is disallowed from crawling, and the sitemap line specifies the location of the XML sitemap (or sitemap index file) for the website so the bot can hop onto it and start indexing from that list. Keeps things nice and efficient!

If you want to stop all bots from indexing content within certain folders – say, an area only accessible to logged-in users, that’s pretty simple to do.

User-agent: *
Disallow: /user-area/
Sitemap: http://www.website.com/sitemap.xml

You can also keep robots out from a single page or file if you want.

User-agent: *
Disallow: /user-area/
Disallow: /assets/media/invoice-template.pdf
Sitemap: http://www.website.com/sitemap.xml

Important Notes On Robots.txt Blocking

It is important to note that blocking things in robots.txt does not prevent them from appearing in search engine results pages altogether. What you may end up seeing in a SERP might well be something like this:

Now for most things this may actually be fine. User areas or invoice templates and so forth – you’re probably not too worried about outlier cases where they show up like this, as long as their full content isn’t being indexed and ranked organically.

In some cases, however, brands may be more sensitive to certain URLs or files and want to ensure they will never show up in a search engine in any shape or form. If this is the case, it is vitally important to ensure that these files are not blocked in robots.txt – the bot will need to crawl the asset thoroughly, not just “ping the URL,” so it can see the robots meta noindex tag or x-robots noindex HTTP header.

It is also critical not to block assets in robots.txt that are needed to render pages in a browser. In the past many developers would mass block things like scripts or CSS folders, but doing this now will result in a grumpy message from Google in Search Console and can have a direct negative impact on your organic visibility levels (Google announced this change in 2014).

Other Important Notes

There are plenty of other elements you might need to know about a robots.txt file. Keep an eye out for some of the following:

Remember that while the robots.txt standard is a directive, it is not enforceable. Naughty and malicious bots and crawlers will generally ignore it altogether in favour of whatever they want from your site. Be aware too that the robots.txt file is always public – anyone can see it by going to the /robots.txt URL on your site! Definitely don’t rely on the robots.txt file to keep secure areas of your site “hidden” or “safe” – use the appropriate encryption and login protocols.