No JS, No BS Ethical Web Analytics

I had two goals: to count AI crawlers DDoSing my nginx infrastructure and to see if anybody reads at least one of my three posts in the blog. To achieve both, I needed to gather data and transform it into meaningful insights, so basically I needed web analytics.

I don't think I need to explain why I wanted an ethical solution. If you are here, you likely have your own reasons. If you follow my work, you might also have some clues, but of course, you can always ask me for more.

In addition to ethical reasons, there are at least three more technical issues with convenient JS-based analytics.

The setup is complicated: deploying a database and backend, injecting js into every response, maintaining all of that.
JS won't count people with ad-blockers, NoScript, or RSS, and I bet most of my readers use at least one of them.
It won't count crawlers and bots that have limited or absent js evaluators.

Yes, without JS I won't be able to track the eye movement and body temperature of the reader.

The funny part is that I already have a lot of data for analytics in a web server's access.log. It's quite surprising how much useful information we can extract from it; we just need to provide a cute representation for the extracted info.

Luckily, there is a GoAccess project, which does exactly that. I could stop right here, and this post would already be useful, but I'll try to save you a few more hours of your life by covering its rough edges and sharing my tricks and findings.

A three-sentence introduction to GoAccess: it takes an arbitrary log file and generates an HTML dashboard with panels having various beautiful plots and tables (try demo or search for goaccess screenshots). You can adjust its behavior with CLI options and persist them in a configuration file. The rest is done by tweaking ({grep,awk,sed}-ing) a log file.

cat ./access.log | goaccess - --config-file ./goaccess.conf

We already have a lot in the default nginx's log file: timings, referers, requests, user agents; however, one thing is missing. I have multiple domains served by my nginx server, and to distinguish requests to different hosts I enriched my nginx's log file with '$host:$server_port ' by setting log_format:

log_format vcombined '$host:$server_port '
        '$remote_addr $remote_user [$time_local] '
        '"$request" $status $body_bytes_sent '
        '"$http_referer" '
        '"$http_user_agent"';

access_log access.log vcombined;

The sample log entry below is from my click on the blog link (I adjusted indentation to mimic newlines from the configuration above, but in a real log, it's one line).

trop.in:443
171.225.184.136 - [31/May/2025:05:52:48 +0200]
"GET /blog HTTP/1.1" 200 1411
"https://trop.in/blog/modern-writers-block-or-how-to-blog"
"Mozilla/5.0 (X11; Linux x86_64; rv:136.0) Gecko/20100101 Firefox/136.0"

It's clear that to get to the blog page, I clicked the link in Modern Writer's Block post and was using trop.in host and https port.

Now I can grep the log file by host[s] and select data for domains or sites I'm interested in. To parse the updated log file format, I added --log-format=vcombined to goaccess. I'll show a complete configuration at the end.

Also, I was curious about how many people read a particular page from Yggdrasil Network and how many from the Clearnet, so I added host:port into "$request" to the beginning of the URI with awk '$7=$1$7' access.log, so the resulting log entry looks like:

trop.in:443
171.225.184.136 - [31/May/2025:05:52:48 +0200]
"GET trop.in:443/blog HTTP/1.1" 200 1411
"https://trop.in/blog/modern-writers-block-or-how-to-blog"
"Mozilla/5.0 (X11; Linux x86_64; rv:136.0) Gecko/20100101 Firefox/136.0"

Thanks to this modification, I can build a separate report where I have two distinct entries trop.in:443/blog and ygg.trop.in:80/blog instead of one /blog. I don't use this report often, but I satisfied my curiosity.

After that, I realized that I rarely need information about all the hosts at once in the reports, so I decided to create a separate log for each server context.

access-in-trop-files.log
access-in-trop-genenetwork.log
access-in-trop-guix-ci.log
access-in-trop.log
access-local.log
access-wildcard.log

Logs for related hosts like ci.guix.trop.in and ci.guix.ygg.trop.in are grouped in access-in-trop-guix-ci.log, for trop.in and ygg.trop.in in access-in-trop.log, and the rest goes to wildcard.

Let's talk about the world map view. To understand the geography of readers and the ISPs of bots, I wanted Geo Location and ASN panels. To make them work, you need geodatabase files with IP to location and ASN mappings. I searched the internet for both GeoLite2-City.mmdb and GeoLite2-ASN.mmdb files, downloaded them to the server and added to the goaccess's configuration.

The last tweak is somewhat naughty, but I wanted real-time analytics, and there is a built-in option for it: goaccess can spawn a WebSocket to constantly update data for the dashboard. Of course, I don't want to expose it to the whole internet, so I made it listen only on localhost.

--real-time-html --host=localhost --port=17001

Now I need to expose both generated HTML and WebSocket to my laptop to conveniently access it. For this, I made a local server context in nginx config:

server {
  listen localhost:80;

  access_log "logs/access-local.log" vcombined;
  location /websocket/goaccess/in-trop {
    proxy_pass http://localhost:17001;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
  }
  location / {
    root /srv/nginx/local;
    autoindex on;
  }
}

and provided an access to it for my laptop on http://localhost:8880 through an SSH tunnel.

ssh -N -L localhost:8880:localhost:80 pinky-ygg

That's the whole setup, now it's time to run goaccess and enjoy the view. Here is a report I use for analytics on my primary site. The report focuses on my flesh-and-blood readers, so I excluded Unknown and Crawlers user agents.

goaccess \
/var/run/nginx/logs/access-in-trop.log --log-format=vcombined \
-o /srv/nginx/local/analytics/trop.in.html \
--real-time-html --port=17001 --host=localhost \
--ws-url=localhost:8880/websocket/goaccess/trop.in \
--geoip-database=GeoLite2-ASN.mmdb --geoip-database=GeoLite2-City.mmdb \
--unknowns-as-crawlers --ignore-crawlers \
--enable-panel=REFERRERS

Also, I have a bot-fighting report and am still playing with parameters, but I already know where they come from, at what time, and what they are looking for. That means, I accomplished both goals: I counted evil crawlers and nicest hoomans.

One missing feature is the ability to splice and filter the data in runtime. I'd like to have an overview for an adjustable time frame and to filter out requests by regex interactively. However, from goaccess and the web server log, I already get more insights than I wished for—and much more than I could get from "convenient" web analytics.

Have any thoughts or comments? Publish them on Fediverse and reference me or publish them somewhere else on the internet and send me a link.