Block AI web-scrapers from stealing your website content

Did you know that you may block AI-related web-scrapers from downloading your whole websites and actually stealing your content. This way LLM models will need to have different data source for learning process!

Why you may ask? First of all, AI companies make money on their LLM, so using your content without paying you is just stealing. It applies for texts, images and sounds. It is intellectual property which has certain value. Long time ago I placed on my website a license “Attribution-NonCommercial-NoDerivatives” and guest what… it does not matter. I did not receive any attribution. Dozens of various bot visit my webiste and just download all the content. So I decided…

… to block those AI-related web-crawling web-scraping bots. And no, not by modyfing robots.txt file (or any XML sitemaps) as it might be not sufficient in case of some chinese bots as they just “don’t give a damn”. Neither I decided to use any kind of plugins or server extenstions. I decided to go hard way:

location / {
  if ($http_user_agent ~* "Bytespider") { return 403; }
  ...
}

And decide to which exactly HTTP User Agent (client “browser” in other words) I would like to show middle finger. For those who do not stare at server logs at least few minutes a day, “Bytespider” is a scraping-bot from ByteDance company which owns TikTok. It is said that this bot could possible download content to feed some chinese LLM. Chinese or US it actually does not matter. If you would like to use my content, either pay me or attribute usage of my content. How you may ask? To be honest I do not know.

There is either hard way (as with NGINX blocking certain UA) or diplomacy way which could lead to creating a websites catalogue which do not want to participate in AI feeding process for free. I think there are many more content creators who would like to get some piece of AI birthday cake…

Conditional Nginx logging

Logging all HTTP traffic is often unnecessary. It especially applies to website which include not only text content but also all kind of additional components, like JavaScripts, stylesheets, images, fonts etc. You can select what you would like to log inclusively, but it is much easier to do this by conditional negative selection. First define log format, then create conditional mapping, last thing is to specify logger with decision variable. For instance:

log_format mylogformat '$remote_addr - $remote_user [$time_local] '
                       '"$request" $status $body_bytes_sent '
                       '"$http_referer" "$http_user_agent" "$gzip_ratio"';

map $request_uri $loggable {
    default                                             1;
    ~*\.(ico|css|js|gif|jpg|jpeg|png|svg|woff|ttf|eot)$ 0;
}

access_log /var/log/nginx/access.log mylogformat if=$loggable;

This way we are not going to log any of additional stuff and keep only regular pages in the log. Will be more useful for further traffic analysis than filtering out those things manually.

Geo location with Filebeat on Elasticsearch 7, HAProxy and NGINX

Display geo location map for NGINX traffic logs in Kibana

Table of Contents

Summary

There are 3 things to remember and configure in order to have geo location map working:

  • Use “forwardfor” option on pfSense HAProxy TLS frontend
  • Enable filebeat NGINX module and point particular log files
  • Define custom NGINX log format

This guide relates to Ubuntu Linux setup.

Elasticsearch 7

First install Elasticsearch 7 as follows.

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg
sudo apt-get install apt-transport-https
echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-7.x.list
sudo apt-get update && sudo apt-get install elasticsearch

Note: for more resilent setup install more than one Elasticsearch server node and enable basic security. For sake of clarity I will skip these two aspects which will be covered by another article.

Kibana

Then install Kibana to have UI for Elasticsearch:

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg
sudo apt-get install apt-transport-https
echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-7.x.list
sudo apt-get update && sudo apt-get install kibana

Configuration

Now be sure to configure both Elastic and Kibana. For Elastic it is /etc/elasticsearch/elasticsearch.yml configuration file. Be sure to set the following (change 0.0.0.0 with your local IP address):

network.host:   0.0.0.0
http.port:      9200
discovery.type: single-node

To enable and start Elasticsearch server:

sudo systemctl enable elasticsearch
sudo service elasticsearch start

Now configure Kibana. Configuration file location is /etc/kibana/kibana.yml. Replace “0.0.0.0” with your local IP address:

server.port: 5601
server.host: "0.0.0.0"
elasticsearch.hosts: ["http://0.0.0.0:9200"]

Enable and start Kibana server:

sudo systemctl enable kibana
sudo service kibana start

Be sure to check if both Elasticsearch and Kibana are up and running with service or systemctl. If everything is fine proceed to setup Filebeat, Packetbeat on client hosts. There is separate guide how to do this. In beats configuration file point to Elasticsearch server you just installed.

HAProxy on pfSense

Example setup includes gateway, which is pfSense with HAProxy package. Configure two frontends, one for HTTP at 80 and one for HTTPS at 443. On HTTP frontend configure

Action http-request redirect
rule: scheme https

On HTTPS configure certificate and check SSL Offloading. Of course you need to load this certificate in System, Cert. Manager. Configure your backends and select it on HTTPS frontend. Now, go to HTTPS frontend and check the following:

Use "forwardfor" option

It is required to be able to read client IP in backend NGINX.

Filebeat on client host

On client hosts install filebeat package. There is separate guide for this one. Edit configuration file, which is /etc/filebeat/filebeat.yml:

filebeat.inputs:
  enabled: false

And then setup and enable NGINX module:

filebeat setup
filebeat modules enable nginx

Now, you are good to go with delivering log files, but first you need to point them in the configuration at /etc/filebeat/modules.d/nginx.yml:

- module: nginx
  access:
    enabled: true
    var.paths: ["/var/log/nginx/app.log"]

You can now enable and start filebeat:

sudo systemctl enable filebeat
sudo service filebeat start

NGINX

In the http section in configuration file /etc/nginx/nginx.conf add new log format:

log_format mydefault '$http_x_forwarded_for - $remote_user [$time_local] '
                     '"$request" $status $body_bytes_sent '
                     '"$http_referer" "$http_user_agent"';

You can use it in application configuration file at /etc/nginx/conf.d/app.conf in the server stanza:

access_log  /var/log/nginx/app.log mydefault;

Restart your NGINX server and go to Kibana to explore your data. You need to add this log format, in order to handle client IP which is present in $http_x_forwarded_for variable. This format as close as possible to the default one.

Network geo location map

If everything went fine which is installting Elasticsearch, Kibana, beats on your client hosts and configuration of HAProxy, NGINX, then you can open Security, Explore, Network section and hit refresh buton to load data into map. But first you need to select Data sources (link above the map, on the right side), include filebeat-* index pattern.

Fig. Filebeat data source selection

With such configuration you should be able to see geo points representing client locations.

Fig. Location points based on filebeat data

In case you also enable Packetbeat, you will see also networking information data below the map.

Fig. Networking information data

Please remember that enabling Packetbeat generates tons of data in complex environments, so be sure to allocate enough space for these indices.

Summary

This guide covers the basic path for installting Elasticsearch, Kibana, Beats and configuring HAProxy and NGINX to deliver traffic, logs to Elasticsearch and be able to visualize traffic as geo location points.

WordPress quirks and features

I will start with the werid experience with one of WP themes – Polite and Polite Grid. I was wondering why my website make double requests on every page. One for the document and other for content. This was annoying as I was unable to measure traffic properly. It turned out that it was because of the theme I’ve been using for some time. Changing it to different one fixed it.

Second of all to make NGINX logs easier to handle I’ve created separate location entry for all the WP things, so the “real” traffic goes only to particular log and everything else goes in different place. Take a look:

location ~ /wp-* {
  access_log /var/log/nginx/michalasobczak.pl.wordpress.access.log;
  ...
} 
location ~ /favicon* {
   ...
}
location / {
  access_log /var/log/nginx/michalasobczak.pl.access.log;
  ...
}

I can now push these logs into Elasticsearch thru filebeat module and need not to bother about all WP-specific things that clutters NGINX logs.

min.io server behind NGINX reverse-proxy

The most recent min.io server release requires one additional thing in the configuration comparing to versions in the past years. Having min.io on one box and NGINX on another one requires setting up a reverse proxy, which is straightforward operation. You need to remember to add proper headers to pass hostname and schema to min.io box. This whole thing is described in the documentation.

proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Host $http_host;

But… you are required to put the following into a min.io configuration file:

MINIO_SERVER_URL="https://your.url/"

This should be put in bold letters beause without this one you could upload artifacts into buckets, but will not be able to download them with share link beacuse of checksum error. This checksum depends on domain name. Default min.io installation offers a console and API on one port only in theory. If you go for console then you are redirected to some temporary port, applicable until next restart of the server. So API will work flawlessly on public domain, but a console in the default installation will not. Please keep that in mind.