Robots.txt is text file used by website owners to give instructions about their site to web robots. Basically it tells robots which parts of the site are open and which parts are closed. This is called The Robots Exclusion Protocol.
Google also offers a similar tool inside of Google Webmaster Central, and shows Google crawling errors for your site.
Example Robots.txt Format
Allow indexing of everything
Disallow indexing of everything
Disawllow indexing of a specific folder
Disallow Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder
Background Information on Robots.txt Files
- Robots.txt files inform search engine spiders how to interact with indexing your content.
- By default search engines are greedy. They want to index as much high quality information as they can, & will assume that they can crawl everything unless you tell them otherwise.
- When you block URLs from being indexed in Google via robots.txt, they may still show those pages as URL only listings in their search results. A better solution for completely blocking the index of a particular page is to use a robots noindex meta tag on a per page bases. You can tell them to not index a page, or to not index a page and to not follow outbound links by inserting either of the following code bits in the HTML head of your document that you do not want indexed.
- <meta name="robots" content="noindex"> <-- the page is not indexed, but links may be followed
- <meta name="robots" content="noindex,nofollow"> <-- the page is not indexed & the links are not followed
- Please note that if you do both: block the search engines in robots.txt and via the meta tags, then the robots.txt command is the primary driver, as they may not crawl the page to see the meta tags, so the URL may still appear in the search results listed URL-only.
- If you do not have a robots.txt file, your server logs will return 404 errors whenever a bot tries to access your robots.txt file. You can upload a blank text file named robots.txt in the root of your site if you want to stop getting 404 errors, but do not want to offer any specific commands for bots.
- Some search engines allow you to specify the address of an XML Sitemap in your robots.txt file, but if your site is small & well structured with a clean link structure you should not need to create an XML sitemap. For larger sites with multiple divisions, sites that generate massive amounts of content each day, and/or sites with rapidly rotating stock, XML sitemaps can be a helpful tool for helping to get important content indexed & monitoring relative performance of indexing depth by pagetype.
Robots.txt Wildcard Matching
Google and Microsoft's Bing allow the use of wildcards in robots.txt files.
To block access to all URLs that include a question mark (?), you could use the following entry:
You can use the $ character to specify matching the end of the URL. For instance, to block an URLs that end with .asp, you could use the following entry:
More background on wildcards available from Google and Yahoo! Search.
URL Specific Tips
Part of creating a clean and effective robots.txt file is ensuring that your site structure and filenames are created based on sound strategy. What are some of my favorite tips?
- Avoid Dates in URLs: If at some point in time you want to filter out date based archives then you do not want dates in your file paths of your regular content pages or it is easy to filter out your regular URLs. There are numerous other reasons to avoid dates in URLs as well.
- End URLs With a Backslash: If you want to block a short filename and it does not have a backslash at the end if it then you could accidentally end up blocking other important pages.
- Consider related URLs if you use Robots.txt wildcards: I accidentally cost myself over $10,000 in profit with one robots.txt error!
- Dynamic URL Rewriting: Yahoo! Search offers dynamic URL rewriting, but since most other search engines do not use it, you are probably better off rewriting your URLs in your .htaccess file rather than creating additional rewrites just for Yahoo! Search. Google offersparameter handling options & rel=canonical, but it is generally best to fix your public facing URLs in a way that keeps them as consistent as possible, such that
- if you ever migrate between platforms you do not have many stray links pointing into pages that no longer exist
- you do not end up developing a complex maze of gotchas as you change platforms over the years
- Sites across markets & languages: Search engines generally try to give known local results a ranking boost, though in some cases it can be hard to build links into many local versions of a site. Google offers hreflang to help them know which URLs are equivalents across languages & markets.
- More URL tips in the naming files section of our SEO training program.
Sample Robot Oddities
Google Generating Search Pages on Your Site?
Google has begun entering search phrases into search forms, which may waste PageRank & has caused some duplicate content issues. If you do not have a lot of domain authority you may want to consider blocking Google from indexing your search page URL. If you are unsure of the URL of your search page, you can conduct a search on your site and see what URL appears. For instance,
- The default Wordpress search URL is usually ?s=
to your robots.txt file would prevent Google from generating such pages
- Drupal powers the SEO Book site, and our default Drupal search URL is /search/node/
Noindex instead of Disallow in Robots.txt?
Typically a noindex directive would be included in a meta robots tag. However, Google for many years have supported using noindex inside Robots.txt, similarly to how a webmaster would use disallow.
The catch, as noticed by Sugarrae, is URLs which are already indexed but are then set to noindex in robots.txt will throw errors in Google's Search Console (formerly known as Google Webmaster Tools). Google's John Meuller also recommended against using noindex in robots.txt.
Secured Version of Your Site Getting Indexed?
In this guest post by Tony Spencer about 301 redirects and .htaccess he offers tips on how to prevent your SSL https version of your site from getting indexed. In the years since this was originally published, Google has indicated a preference for ranking the HTTPS version of a site over the HTTP version of a site. There are ways to shoot yourself in the foot if it is not redirected or canonicalized properly.
Have Canonicalization or Hijacking Issues?
Throughout the years some people have tried to hijack other sites using nefarious techniques with web proxies. Google, Yahoo! Search, Microsoft Live Search, and Ask all allow site owners to authenticate their bots.
- While I believe Google has fixed proxy hijacking right now, a good tip to minimize any hijacking risks is to use absolute links (like <a href="http://www.seobook.com/about.shtml">) rather than relative links (<a href="about.shtml">) .
- If both the WWW and non WWW versions of your site are getting indexed you should 301 redirect the less authoritative version to the more important version.
- The version that should be redirected is the one that does not rank as well for most search queries and has fewer inbound links.
- Back up your old .htaccess file before changing it!
- answered 2 years ago
- Gul Hafiz