Search Engine Optimization

What is robots.txt?

what is robots.txt?

Updated on May 5, 2019

As you start digging into the technical side of SEO you may hear about robots.txt. What is robots.txt? The robots.txt file is a set of instructions to guide search engine spiders while they crawl your website.

This is part of the Robots Exclusion Protocol which is a standard used by websites to communicate with web crawlers.

The robots.txt file is located in the root directory of your website. It always has to be called robots.txt and there can be only one.

How to find out if your website has a robots.txt file

To find out if your website has a robots.txt file (and view it) simply add the filename to the end of your URL in a web browser.

So, if your website URL is https://example.com, open Google Chrome and type in https://example.com/robots.txt.

If the file exists on your website it will load and you will be able to read it directly in the web browser.

How to make a robots.txt file

To make a robots.txt file you can use any plain text editor.

Do not use a program such as Microsoft Word or Google Docs. Those programs will try to auto-correct your text. They also format characters into a version that will not translate properly when uploaded to your website.

Use Notepad on a Windows computer or TextEdit on a Mac. If you’d like the editor to look a little prettier, you can download a text editor such as Sublime Text.

Before we show the exact code you’ll want to use, let’s discuss the common pieces of the robots.txt file.

Components of a robots.txt file

User-agent

This specifies the web crawler you are targeting for the following rule. If you use an asterisk (*) the rule applies to all.

The Web Robots Database has a large list of the various user agents. For Google specific user agents, Google provides a full list of crawlers.

Disallow

This specifies the URL or directory you would like to block from web crawlers. You could use this to block logged in administrative pages or thank you pages that should only be displayed after a successful form submission.

It’s important to note that disallowing the crawlers to view parts of your website is meant to optimize the crawl rate for the spiders. It helps guide them to the pages that matter.

It’s no guarantee that those pages won’t appear in search engine results pages. If you’re looking to exclude pages from search results, look into adding a noindex meta tag.

Noindex

You can add a noindex tag to robots.txt with the hope that the page will not be indexed in search engines. It’s common to add a rule to remove a page from search engines by including both a disallow and a noindex assignment.

But, Google has stated they do not recognize the noindex directive in robots.txt and this should instead be included as a meta tag. If this page is linked to somewhere else it is still possible for it to appear in search results.

Allow

If you block a directory, you can use this to allow access to a specific file within a blocked directory.

Sitemap

You can specify the location(s) of your sitemap file(s).

Now, let’s go over some of the common scenarios you’d want to use in your robots.txt file.

The most common configurations for robots.txt files

Block Nothing

This is the most common (and safest) setup for your robots.txt file. Allow all web crawlers to view everything on your website.

This is what the robots.txt file would look like:

User-agent: *
Disallow:

This is saying for all web crawlers, disallow nothing.

Block Everything

If you have a robots.txt file to block everything this is most likely a mistake. If you truly want to block everything so the site doesn’t appear in search results you’ll want to password protect the website instead.

This is what the robots.txt file looks like when everything is blocked:

User-agent: *
Disallow: /

You can see it’s very similar to the code you’d provide to block nothing. But, instead of leaving disallow blank, it now includes a / which indicates every single page after the root domain is disallowed.

Block Directory

You can use robots.txt to block a specific directory. This is useful to block administration pages or pages that require the user to be logged in.

If you use WordPress you’ll want to block the wp-admin directory.

Here is what the robots.txt file looks like when you block a directory:

User-agent: *
Disallow: /wp-admin/

This tells all web crawlers to ignore the /wp-admin/ directory. Since all of those pages are password protected this helps optimize the pages that are crawled. It guides the crawler toward the pages you want indexed and ranking.

Block Page

You can also create your robots.txt file to disallow web crawlers on a particular page.

This can be used to deter spiders from accessing pages that are accessed only when an action has been performed. For example, a thank you page after a payment has been processed.

The configuration is very similar to blocking a directory. You input the page URL after the initial domain into the disallow attribute. It would look like this:

User-agent: *
Disallow: /thank-you.html

This would tell all search engine spiders to ignore the page located at /thank-you.html.

The robots.txt file can be ignored

The robots.txt file will help search engine spiders crawl your website, if they choose to read it.

This file will do nothing to deter spam bots from crawling your website. Spam bots will ignore the file completely so even if you have it set to block everything they can still crawl your site.

The data in robots.txt is publicly available

Anyone can view your robots.txt file. Don’t use it to try and block private data.

The same easy method you use to access your robots.txt file can be used by anyone. Since the file always has to be in the same location and use the same name anyone can view it.

Of course, this also means you can view anyone else’s as well if you are curious what rules they’ve applied.

Test your robots.txt file

Once you have a robots.txt file in place remember to test it. You can do this in the Google Search Console. For now, it has to be done in the old version. Soon it will be moved to the new version.

To access the robots.txt file tester:

  1. Navigate to Google Search Console.
  2. If using the new version, scroll down the the bottom of the left menu and click Go to old version.
  3. Navigate to Crawl > robots.txt Tester.

This will show you your robots.txt file and if there are any errors or warnings.

Do you have any robots.txt tips you’d like to share? Let us know in the comments!

About the Author

Jennifer Rogina has been a digital marketing specialist since 2008. During those years she has focused on Pay Per Click Advertising, Search Engine Optimization, and Conversion Rate Optimization.

master online marketing course

Master Online Marketing with Only 1-Hour a Week

Digital Marketing Strategies for People in a Hurry.

Cost: FREE ($299 Value)