{"id":27,"date":"2006-09-19T11:24:20","date_gmt":"2006-09-19T11:24:20","guid":{"rendered":"http:\/\/ramblingsofasysadmin.com\/blog\/?p=27"},"modified":"2006-09-19T11:24:20","modified_gmt":"2006-09-19T11:24:20","slug":"robots-txt-and-spidering","status":"publish","type":"post","link":"https:\/\/ramblingsofasysadmin.com\/?p=27","title":{"rendered":"robots.txt and spidering."},"content":{"rendered":"<p>when you have content that is not for public consumption, you should always be safe than sorry by preventing the search engines from crawling (or spidering) the page and learning your link structure.  for example, in a development environment, it would hardly be useful for the page to be viewed as if it&#8217;s a public site when it&#8217;s not ready yet.<br \/>\nenter <font face=\"courier\">robots.txt<\/font>.  this file is extremely important; search engines look for that file and determine whether the site can be entered into its search cache or if you want to keep it private.<br \/>\nthe basic <font face=\"courier\">robots.txt<\/font> file works like this: you stick the file in the root of your website (e.g. the <font face=\"courier\">public_html<\/font> or <font face=\"courier\">httpdocs<\/font> folder.  it won&#8217;t work if it&#8217;s located anywhere else or in a subdirectory of the site.<br \/>\nthe crux of the <font face=\"courier\">robots.txt<\/font> is the <font face=\"courier\">User-Agent<\/font> and <font face=\"courier\">disallow<\/font> directives.  if you don&#8217;t want <b>any<\/b> search engine bots to spider your any files on your site, the basic file looks like this:<br \/>\n<font face=\"courier\">User-agent: *<br \/>\nDisallow: \/<\/font><br \/>\nhowever, if you don&#8217;t want the search engines to crawl a specific folder, e.g. www.yoursite.com\/private, you would create the file as so:<br \/>\n<font face=\"courier\">User-agent: *<br \/>\nDisallow: \/private\/<\/font><br \/>\nif you don&#8217;t want <b>google<\/b> to spider a specific folder called \/newsletters\/, then you would use the following:<br \/>\n<font face=\"courier\">User-agent: googlebot<br \/>\nDisallow: \/newsletters\/<\/font><br \/>\nthere are hundreds of bots that you&#8217;d need to consider, but the main ones are probably google (googlebot), yahoo (yahoo-slurp), and msn (msnbot).<br \/>\nyou can also target multiple user-agents in a robots.txt file that looks like this:<br \/>\n<font face=\"courier\">User-agent: *<br \/>\nDisallow: \/<br \/>\nUser-agent: googlebot<br \/>\nDisallow: \/cgi-bin\/<br \/>\nDisallow: \/private\/<\/font><br \/>\nthere&#8217;s a great reference on user agents <a href=\"http:\/\/en.wikipedia.org\/wiki\/User_agent\" target=\"_new\">on wikipedia<\/a>.  another great resource is <a href=\"http:\/\/www.mcanerin.com\/EN\/search-engine\/robots-txt.asp\">this robots.txt file generator<\/a>.<br \/>\nwhere security is concerned, a <font face=\"courier\">robots.txt<\/font> file makes a huge difference.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>when you have content that is not for public consumption, you should always be safe than sorry by preventing the search engines from crawling (or spidering) the page and learning your link structure. for example, in a development environment, it would hardly be useful for the page to be viewed as if it&#8217;s a public site when it&#8217;s not ready yet. enter robots.txt. this file is extremely important; search engines look for that file and <span class=\"ellipsis\">&hellip;<\/span> <span class=\"more-link-wrap\"><a href=\"https:\/\/ramblingsofasysadmin.com\/?p=27\" class=\"more-link\"><span>Read More &rarr;<\/span><\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-27","post","type-post","status-publish","format-standard","hentry","category-tutorials"],"_links":{"self":[{"href":"https:\/\/ramblingsofasysadmin.com\/index.php?rest_route=\/wp\/v2\/posts\/27","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ramblingsofasysadmin.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ramblingsofasysadmin.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ramblingsofasysadmin.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ramblingsofasysadmin.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=27"}],"version-history":[{"count":0,"href":"https:\/\/ramblingsofasysadmin.com\/index.php?rest_route=\/wp\/v2\/posts\/27\/revisions"}],"wp:attachment":[{"href":"https:\/\/ramblingsofasysadmin.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=27"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ramblingsofasysadmin.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=27"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ramblingsofasysadmin.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=27"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}