spiders.txt contribution

Steve Lionel
steve@stevelionel.com

The file catalog/includes/spiders.txt is used in conjunction with the "Prevent Spider Sessions" feature in Admin->Sessions.  When that feature is enabled, osC downcases the "user agent" from the HTTP request and looks to see if any of the strings in spiders.txt are found in it.  This is to identify search engine spiders (also called robots).  If there is a match, the spider is not assigned a PHP session, but can still access the page.  The benefits of this are:

- The search engines do not include session IDs in their index.
- The spiders are not able to add items to a "cart", thus filling up your database with worthless carts

This feature does NOT prevent spiders from indexing your site.  Rather, it improves the quality of the indexing.

osCommerce 2.2-MS2 was released in early 2003 and many new spiders have been launched since that time.  Keeping spiders.txt up to date is highly recommended.

This contribution contains a spiders.txt that has been updated with new spiders that I have seen visiting my sites.  It has also been streamlined somewhat, as the stock one has redundant and incorrect lines - these do no harm but make page loading slower.

I will keep updating this as I identify new active spiders. I am not trying to identify every spider ever created. A second file, spiders-large.txt, is included.  This adds spider strings as supplied by ChrisW123, and I have optimized it somewhat, removing redundant strings.  I have not tried to validate all the strings in this file.  You can use the "large" file if you want to further reduce the chance that a spider will start a session, but at the cost of slower loads of every page.  If you use this, rename it to spiders.txt when you place it in your catalog/includes folder.

Please note that the strings here are not necessarily what spiders look for in robots.txt - a separate issue.

The purpose of this file is to tell if an incoming request is from a spider, NOT to identify a particular spider.  Therefore, there are some common substrings in the list such as "spider", "crawl", and "obot" which match many different spiders. For example, "ebot" matches Googlebot, "nbot" matches msnbot. The strings in this file MUST be all lowercase, or else they will be ignored. If you think a particular spider is missing from the list, please post in the support topic and include a line from your access log showing the spider access including the full user agent string.  Please do not update this contribution unless you fully understand how it works.

Comments or questions should go in the support topic at http://forums.oscommerce.com/index.php?showtopic=112609 Please contact me if you have suggestions or changes for this list.

Released under the GNU Public License

---
The following is a list of spiders NOT included, as they appear to be "link validators"and not actual search engine spiders:

linkalarm
validator
zealbot
zeus
feedchecker
del.icio.us (fetches images only)
pingdom (site monitor)