Not all bots are created equal. Some bots are good, some bots are bad, and some bots are not what they appear to be. This article will discuss an aspect of bots that attempt to exploit the sensitive nature of SEO optimization rules.
We love search engines. If it weren’t for search engines most of our sites would never be discovered. When the search engine bot comes knocking, we best let it in, or we will suffer the consequences of not existing. Let’s face it, if you aren’t on the first page of the search results, you don’t exist. Let’s break down a request made by a search engine bot:
66.249.64.147 – – [24/Sep/2018:07:58:32 -0400] “GET / HTTP/1.1” 200 2947 “-” “Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
We can identify this request as coming from Google’s indexing bot. We do this by examining the User Agent provided during the request. Unfortunately, this header can be set to any value and cannot be trusted. While we understand this, it is common for site operators to always allow this user agent for fear of vanishing from the Internet.
We are now presented with a problem. How do we know that the actor identifying as the Google indexer actually belongs to Google? Luckily the popular search engines typically provide documentation on how to validate an actor. For example, Google documents this process at https://support.google.com/webmasters/answer/80553?hl=en. At this point you are probably wondering how this pattern applies to other search engine bots. The good news is that while the User Agents and valid domains may change, the process is still the same. It goes something like this:
If all of these checks pass, the actor is a valid search engine bot. If any of these checks fails, it isn’t. We are now faced with the problem of validating any actor that claims to be a search engine indexer. Blindly allowing these actors introduces a technical weakness that could lead to a loss event. It is interesting to observe this trend in action. You can do so by searching your logs for actors that match a pattern and performing the verification steps manually. You will likely fine a number of actors that fail the test. We have observed that even on low traffic sites, there are requests every day that fail validation.
We all know that doing this manually doesn’t scale and that neither does manual IP blocking. As usual, this is a process that can be automated.
It’s important to keep edge processing at the edge. This is a common mistake in application security. It’s typically easiest to take all application logic and put it in the application. This makes deployment and operations easier. The problem with this practice is that it allows requests to make it to your application that never should have. This presents an opportunity for an actor to exploit a vulnerability in your application. It’s effectively lack of input validation on a meta level. In general, if the request doesn’t look right and your application shouldn’t process it, it should be rejected before your application has to try.
Applying this idea leads us to an obvious choice; do it at the webserver layer. There are options for this, but NGINX is the most widely used webserver available. It also has a nice API for creating custom modules and request handlers. Because this validation process only looks at the User Agent and IP address of the request, it requires very little work to perform the validation and keep actors from hitting our application.
I took some time recently to put this idea into an NGINX module. You can find the code at https://github.com/abedra/ngx_bot_verifier. It’s an open source project and all ideas, pull requests, and issues are welcome. The module handles the validation steps described and works for the following search engines:
Because the validation happens inline with the request, and the validation requires a response from DNS, that validation can introduce perceived latency. In order to minimize this the module is backed by a Redis cache. All validation results are cached to prevent latency for subsequent requests. The cache expiry is handled by a timeout configured in the module directives and simply expires over a period of time since the last validation.
You’ve got a lot to do. The point of automating away threats like this is to reduce the noise and increase the signal of more sophisticated attacks. It produces a heightened awareness of threats and allows you to better understand what types of attacks you are seeing day to day and provides valuable information that could help you understand what attackers are after. Data produced by these types of tools also supports threat models and risk analysis to better support your overall information security program.
This is just one of the many problems you face every day in the world of web application defense. There are many ways to solve this problem, but I encourage you to do so. There are some search engines that make validation difficult. For example, DuckDuckGo uses Amazon EC2 instances that don’t resolve back to DuckDuckGo and thus cannot follow the typical validation process. There are other search engines that present different challenges. If you do decide to give this module a try, let us know. If you find it useful, have questions, or would like to see improvements, let us know. We believe we can make the world a little safer by sharing, and we are hoping that this starts a broader conversation on handling bad robots.