Robot regulation with LineRate

via Robot regulation with LineRate.

This article discusses a method for regulating HTTP accesses from robots (aka. crawlers, spiders, or bots) by the use of F5 LineRate Precision Load Balancer.

A growing number of accesses from robots can potentially affect the performance of your web services. Studies show that robots accounted for 35% of accesses in 2005 [1], and increased to 61.5% in 2013 [2]. Many sites employ the de facto Robots Exclusion Protocol standard [3] to regulate the access, however, not all the robots follow this advisory mechanism [4]. You can filter the disobedient robots somehow, but that will be an extra burden for already heavily loaded web servers. In this use-case scenario, we utilize LineRate scripting to exclude the known robots before they reach the backend servers.

The story is simple. When a request hits LineRate, it checks the HTTP User-Agent request header. If it belongs to one of the known robots, LineRate sends the 403 Forbidden message back to the client without conveying the request to the backend servers. A list of known robots can be obtained from a number of web sites. In this article, user-agent.org was chosen as the source because it provides a XML formatted list. The list contains legitimate user agents, so only the entries marked as ‘Robots’ or ‘Spam’ must be extracted.