You could write a script that examines an Apache web server log file and then based on defined criteria determines if there are robots spidering the site without a user agent that identify them as bots.
You could write the IP addresses to a file with the probability that the IP address is a bot, based on a scoring criteria.
You could do the classification using Bayes theory and encode this into your methodology.
Then, when finished, you can post here
We will use it on our log files !