The management of web robots is intended to assist web page writers and web site mangers and programmers to make their pages more accessible to the web community. The point is to make the pages more readily identifiable to search engines by tuning header information to how the search engines go through their indexing process.
Webmasters may be surprised to find that their site has been indexed by an indexing robot and that the robot has provided links (through a search engine) deeply into the site. More suprising, and unpleasantly so, is if the robot should not have been permitted to visit a sensitive part of the site or if all entry to a site is desired through a single portal-like home page. The /robots.txt protocol is one way of dealing with this issue.
(Yes, there are specific ways to insure site security, and no, the /robots.txt protocol is not intended for security, but there can be a middle ground of an open site that is not indexed: if there are files on your web site that you don't want unauthorized people to access, then some type of authenticaion, either server-based athorization or, for the really serious, Secure Socket Layers, [SSL].)
Many Web robots offer facilities for web site administrators and content providers to limit what the robot does. Typically, control of these robots is achieved through two mechanisms: a "robots.txt" file and the meta element in individual html documents. Both of these are described below.
Generally, search engines have this information available within a few links down from their home pages: for example, Google places the information under Frequently Asked Questions under Information for Webmasters under Jobs, Press, & Help off the Google home page.
For more complete information than is presented here, visit the Web Robots Pages.
When a Robot visits a Web site, say http://www.docsteve.com/, it firsts checks for http://www.docsteve.com/robots.txt. If it can find this document, it will analyze its contents to see if it is allowed to retrieve the document. The robots.txt may be customized to apply only to specific robots and to disallow access only to specific directories or files.
What the robot looks for is a "/robots.txt" URI on the site, where a site is defined as an HTTP server running on a particular host and port number. Here are some sample URIs for robots.txt:
URI for web site: |
URI for robots.txt: |
---|---|
http://www.docsteve.com/ |
http://www.docsteve.com/robots.txt |
http://www.docsteve.com:80/ |
http://www.docsteve.com:80/robots.txt |
http://www.docsteve.com:8080/ |
http://www.docsteve.com:8080/robots.txt |
http://docsteve.com/ |
http://docsteve.com/robots.txt |
The robot will always look in the web server's web document root directory for the robots.txt file. That means the robots.txt file goes in the same directory as would typically be the index.html file (the home page). If, for example, the actual path within the directory structure of the web server from the filesystem root to the web files root is
/usr/local/web_services/docsteve/www/
and the web directory structure begins in www (e.g., the home page, index.html, with the URI "http://www.docsteve.com/index.html" is the physical file
/usr/local/web_services/docsteve/www/index.html
then the robots.txt file would be the physical file
/usr/local/web_services/docsteve/www/robots.txt
Here is a sample robots.txt file that prevents all robots from visiting the entire site:
User-agent: * # applies to all robots Disallow: / # disallow indexing of all pages
There can only be a single "/robots.txt" on a site. Specifically, you should not put "robots.txt" files in user or sub directories, because a robot will never look at them. If you want different sections of the site to be indexed differently (or if there are different users who might want to be able to create their own "robots.txt", all information must be merged into a single "/robots.txt" in the root. An alternative to this is to use the Robots META Tag instead (see below).
There must be exactly one "User-agent" field per record. The robot should be liberal in interpreting this field. A case-insensitive substring match of the name without version information is recommended. If the value is "*", the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
The "Disallow" field specifies a partial URI that is not to be visited. This can be a full path, or a partial path; any URI that starts with this value will not be retrieved. For example,
Disallow: /help
disallows both /help.html and /help/index.html, whereas
Disallow: /help/
would disallow /help/index.html but allow /help.html.
Some additinional Disallow usage notes include,
The meta element allows html authors to tell visiting robots whether an individual document may be indexed or used to harvest more links. No server administrator action is required.
In the following example a robot should neither index this document, nor analyze it for links.
<meta name="robots" content="noindex, nofollow" />
The list of terms in the content is all, index, follow, noindex, nofollow. The name and the content attribute values are case-insensitive.
While some indexing engines read entire pages and pull out nouns and verbs, there are others that look for meta elements. These elements generally present either a comma-separated list of keywords/phrases or give a short description. Search engines may then present these as the result of a search. The value of the name attribute sought by a search attribute (keyword or description or even something else) is not defined in any formal specification, but consider these examples:
<meta name="keywords" content="Selkirk,railroads,yards,CSX,Conrail,New York Central" /> <meta name="description" content="Railfan's guide to CSX's Selkirk Yard" />
In the global context of the Web, it is important to know which in human language a page was written (as opposed to which machine language, such as html 3.2, html 4.01, xhtml 1.1, javascript 1.5, etc.). Use the "lang" attribute to the html element; examples for html and xhtml follow:
<html lang="en-us"> <html xmlns="http://www.w3.org/1999/xhtml"> xml:lang="en-US" lang="en-US"
If you have prepared translations of this document into other languages, you should use the LINK element to reference these. This allows an indexing engine to offer users search results in the user's preferred language, regardless of how the query was written. For instance, the following links offer French and German alternatives to a search engine (elements are in XHTML format: lowercase with attributes quoted and internally closed):
<link rel="alternate" type="text/html" href="se_info-fr.html" hreflang="fr" lang="fr" title="Chemin de Fer de Selkirk" /> <link rel="alternate" type="text/html" href="se_info-de.html" hreflang="de" lang="de" title="Eisenbahnen von Selkirk" />
Collections of word processing documents or presentations are frequently translated into collections of HTML documents. It is helpful for search results to reference the beginning of the collection in addition to the page hit by the search. You may help search engines by using the link element with rel="start" or rel="index" attribute along with the title attribute, as in the following examples:
<link rel="begin" type="text/html" href="se_info.htm" title="Selkirk: General Information" /> <link rel="index" href="index.html" />
Document: http://
Revised: |