Think Information. Think Security.
A group of researchers from the University of California Santa Barbara and Northeastern University have created a system called PubCrawl for detecting Web crawlers, even if the automated bots are coming from a distributed collection of Internet addresses. The system combines multiple methods of discriminating between automated traffic and normal user requests, using both content and timing analysis to model traffic from a collection of IP addresses, the researchers stated in a paper to be presented at the USENIX Security Conference on Friday. Websites want to allow legitimate visitors to get the data they need from their pages, while blocking wholesale scraping of content by competitors, attackers and others who want to use the data for non-beneficial purposes, says Christopher Kruegel, an associate professor at UCSB and one of the authors of the paper.

Using data from a large, unnamed social network, the team trained the PubCrawl system to detect automated crawlers and then deployed the system to block unwanted traffic to a production server. The researchers had a high success rate: Crawlers were positively identified more than 95% of the time, with perfect detection of unauthorized crawlers and nearly 99% recognition of crawlers that masquerade as Web bots from a legitimate service. 

A significant advance for crawler detection is the recognizing the difference in traffic patterns between human visitors and Web bots, says Gregoire Jacob, a research scientist at UCSB and another co-author of the paper. By looking at the distribution of requests over time, the system can more accurately detect bots. When the researchers graphed a variety of traffic patterns, the differences became obvious, says Jacob.

The researchers did not stop at using the signal patterns to improve the accuracy of their system. The team also tried to link similar patterns between disparate Internet sources that could indicate a distributed Web crawler. The PubCrawl system clusters Internet addresses that demonstrate similar traffic patterns into crawling campaigns. Such distributed networks are the main threat to any attempt to prevent content scraping. PubCrawl can be set to allow a certain number of "free" requests per Internet address -- under that limit, no request will be denied. Once above that limit, then the system will attempt to identify the traffic pattern. Attacks that use a very large number of low-bandwidth requests could escape notice. 

For traffic above the minimum threshold that does not match any known pattern, the PubCrawl system uses an active countermeasure, forcing the user to input the occasional CAPTCHA. Sources of requests that ask for non-existent pages, fail to revisit pages, have odd referrer fields, and ignore cookies will all be flagged as automated crawlers much more quickly. Much of this is not new to the industry, says Matthew Prince, CEO of Cloudflare, a website availability and security service. Companies such as Incapsula, Akamai, and Cloudflare have already created techniques to find and classify Web crawlers.

Rival security firm Incapsula has noted the increase in automated Web traffic, which, in February, reached 51% of all traffic seen by websites. While 20% of Web requests are search engine indexers and other good bots, 31% are competitors and intelligence gathering bots as well as site scrapers, comment spammers, and vulnerability scans. Yet with Web traffic set to increase five-fold by 2016, teasing out which traffic is good and which is bad will become more difficult, says Sumit Agarwal, vice president of product management for security start-up Shape Security.

Cross-posted from: Dark Reading
9/5/2012 11:25:27 pm

I lately came across this website and i also must say it becomes an excellent publish. I'd individually try to be verifying in many consistently in future.

9/5/2012 11:31:44 pm

Thanks for such nice blog on website abuse.I can see that you are an expert at your field. Thanks for all your help and wishing you all the success in your business.

9/5/2012 11:33:01 pm

We are not used to blogs and in reality liked your blog. I'll search for your web blog and keep viewing you. I absolutely should give you thanks with regards to writing your website.


Leave a Reply.