robots.txt and spidering. | ramblings of a sysadmin.

robots.txt and spidering.

when you have content that is not for public consumption, you should always be safe than sorry by preventing the search engines from crawling (or spidering) the page and learning your link structure. for example, in a development environment, it would hardly be useful for the page to be viewed as if it’s a public site when it’s not ready yet.
enter robots.txt. this file is extremely important; search engines look for that file and determine whether the site can be entered into its search cache or if you want to keep it private.
the basic robots.txt file works like this: you stick the file in the root of your website (e.g. the public_html or httpdocs folder. it won’t work if it’s located anywhere else or in a subdirectory of the site.
the crux of the robots.txt is the User-Agent and disallow directives. if you don’t want any search engine bots to spider your any files on your site, the basic file looks like this:
User-agent: *
Disallow: /

however, if you don’t want the search engines to crawl a specific folder, e.g. www.yoursite.com/private, you would create the file as so:
User-agent: *
Disallow: /private/

if you don’t want google to spider a specific folder called /newsletters/, then you would use the following:
User-agent: googlebot
Disallow: /newsletters/

there are hundreds of bots that you’d need to consider, but the main ones are probably google (googlebot), yahoo (yahoo-slurp), and msn (msnbot).
you can also target multiple user-agents in a robots.txt file that looks like this:
User-agent: *
Disallow: /
User-agent: googlebot
Disallow: /cgi-bin/
Disallow: /private/

there’s a great reference on user agents on wikipedia. another great resource is this robots.txt file generator.
where security is concerned, a robots.txt file makes a huge difference.

Leave a Reply

Post Navigation