All websites on the internet are constantly being crawled by bots. If you run a web app or server, you may encounter issues where bot traffic causes a variety of problems, such as:
- Slow performance
- High bandwidth usage
- Security vulnerabilities
With the recent advancements in AI technology, there has been a noticeable increase in the number of crawling bots that scrape data from websites, and then use it to train machine learning models.
However, not all bots are bad. There are some other bots that are necessary for your site.
For example, search engines such as Google and Bing use bots to index your site. If you block these bots from accessing your website, then it won’t be displayed in the search results.
With that being said, here are some steps you can take to block bad bots from your web apps and servers:
- Identify Bad Bots: The first step in blocking bad bots is identifying them. There are several tools available that can help you identify bot traffic, including Google Analytics and server log files. Look for traffic that appears suspicious or repetitive, as this could indicate bot activity.
- Use Robots.txt: The robots.txt file is a text file that tells search engine crawlers which pages or sections of your site they are allowed to crawl. It is important to note that the content of this file is treated as merely a suggestion. Crawlers can choose to ignore this without any consequences.
- Use .htaccess: The .htaccess file is a configuration file used by Apache web servers to control access to specific directories or files. You can use this file to block specific IP addresses or user agents. However, this method requires some technical knowledge, and can be time-consuming. This is not a good solution as you will need to individually block IP addresses, and scraping bots can easily access the site again by changing their IP address.
- Use a Firewall: A firewall can be an effective way to block bad bots. Most web application firewalls (WAFs) have a feature that allows you to block traffic from specific IP addresses or user agents. RunCloud’s built-in Web Application Firewall, which is powered by the ModSecurity engine and OWASP ModSecurity Core Rule Set, is a good solution for this.
You can also use a cloud-based firewall service such as Cloudflare, which has a wide range of tools to help you block bad bots. These firewalls are a good solution as they block the traffic even before it reaches your servers.
How to Use the RunCloud Web Application Firewall
You can use RunCloud’s built-in firewall functionality to block unwanted traffic on your site. To enable this, go to your RunCloud dashboard and look for the “Firewall” tab in the side menu.
RunCloud uses user-friendly settings to manage firewalls. On the next screen, you can configure the Paranoia Level and Anomaly Threshold setting to your liking.
For most sites, we recommend starting out with a lower value and gradually increasing it as needed.
After you have configured the firewall, you can also add rules manually to block or allow visitors on your site. These rules can be defined on the basis of cookie value, country of origin, IP address, user agent, etc.
Once you enable the firewall, any unwanted visitors will get a 403 Forbidden error when they try to visit your site.
It’s important to monitor your site regularly for bot activity and adjust your blocking methods as needed. Some bots may use tactics to evade detection, so it’s important to stay up to date, and adjust your blocking methods accordingly.
Read the following articles to learn more about the topic: