I had two goals: to count AI crawlers DDoSing my nginx infrastructure and to see if anybody reads at least one of my three posts in the blog. To achieve both, I needed to gather data and transform it into meaningful insights, so basically I needed web analytics.
I don't think I need to explain why I wanted an ethical solution. If you are here, you likely have your own reasons. If you follow my work, you might also have some clues, but of course, you can always ask me for more.
In addition to ethical reasons, there are at least three more technical issues with convenient JS-based analytics.
- The setup is complicated: deploying a database and backend, injecting js into every response, maintaining all of that.
- JS won't count people with ad-blockers, NoScript, or RSS, and I bet most of my readers use at least one of them.
- It won't count crawlers and bots that have limited or absent js evaluators.
Yes, without JS I won't be able to track the eye movement and body temperature of the reader.
The funny part is that I already have a lot of data for analytics in a
web server's access.log
. It's quite surprising how much useful
information we can extract from it; we just need to provide a cute
representation for the extracted info.
Luckily, there is a GoAccess project, which does exactly that. I could stop right here, and this post would already be useful, but I'll try to save you a few more hours of your life by covering its rough edges and sharing my tricks and findings.
A three-sentence introduction to GoAccess: it takes an arbitrary log
file and generates an HTML dashboard with panels having various
beautiful plots and tables (try demo or
search for goaccess screenshots
). You can adjust its behavior with
CLI options and persist them in a configuration
file.
The rest is done by tweaking
({grep,awk,sed}-ing) a log file.
cat ./access.log | goaccess - --config-file ./goaccess.conf
We already have a lot in the default nginx's log file: timings,
referers, requests, user agents; however, one thing is missing. I
have multiple domains served by my nginx server, and to distinguish
requests to different hosts I enriched my nginx's log file with
'$host:$server_port '
by setting log_format:
log_format vcombined '$host:$server_port '
'$remote_addr $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" '
'"$http_user_agent"';
access_log access.log vcombined;
The sample log entry below is from my click on the blog link (I adjusted indentation to mimic newlines from the configuration above, but in a real log, it's one line).
trop.in:443
171.225.184.136 - [31/May/2025:05:52:48 +0200]
"GET /blog HTTP/1.1" 200 1411
"https://trop.in/blog/modern-writers-block-or-how-to-blog"
"Mozilla/5.0 (X11; Linux x86_64; rv:136.0) Gecko/20100101 Firefox/136.0"
It's clear that to get to the blog page, I clicked the link in Modern
Writer's
Block post
and was using trop.in
host and https port.
Now I can grep the log file by host[s] and select data for domains or
sites I'm interested in. To parse the updated log file format, I added
--log-format=vcombined
to goaccess. I'll show a complete
configuration at the end.
Also, I was curious about how many people read a particular page from
Yggdrasil Network and how many
from the Clearnet, so I added host:port
into "$request"
to the
beginning of the URI with awk '$7=$1$7' access.log
, so the resulting
log entry looks like:
trop.in:443
171.225.184.136 - [31/May/2025:05:52:48 +0200]
"GET trop.in:443/blog HTTP/1.1" 200 1411
"https://trop.in/blog/modern-writers-block-or-how-to-blog"
"Mozilla/5.0 (X11; Linux x86_64; rv:136.0) Gecko/20100101 Firefox/136.0"
Thanks to this modification, I can build a separate report where I
have two distinct entries trop.in:443/blog
and ygg.trop.in:80/blog
instead of one /blog
. I don't use this report often, but I satisfied
my curiosity.
After that, I realized that I rarely need information about all the hosts at once in the reports, so I decided to create a separate log for each server context.
access-in-trop-files.log
access-in-trop-genenetwork.log
access-in-trop-guix-ci.log
access-in-trop.log
access-local.log
access-wildcard.log
Logs for related hosts like ci.guix.trop.in
and
ci.guix.ygg.trop.in
are grouped in access-in-trop-guix-ci.log
, for
trop.in
and ygg.trop.in
in access-in-trop.log
, and the rest goes
to wildcard.
Let's talk about the world map
view. To understand the
geography of readers and the ISPs of bots, I wanted Geo Location and
ASN panels. To make them work, you
need geodatabase files with IP
to location and ASN mappings. I searched the internet for both
GeoLite2-City.mmdb
and GeoLite2-ASN.mmdb
files, downloaded them to
the server and added to the goaccess's
configuration.
The last tweak is somewhat naughty, but I wanted real-time analytics, and there is a built-in option for it: goaccess can spawn a WebSocket to constantly update data for the dashboard. Of course, I don't want to expose it to the whole internet, so I made it listen only on localhost.
--real-time-html --host=localhost --port=17001
Now I need to expose both generated HTML and WebSocket to my laptop to conveniently access it. For this, I made a local server context in nginx config:
server {
listen localhost:80;
access_log "logs/access-local.log" vcombined;
location /websocket/goaccess/in-trop {
proxy_pass http://localhost:17001;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
location / {
root /srv/nginx/local;
autoindex on;
}
}
and provided an access to it for my laptop on http://localhost:8880 through an SSH tunnel.
ssh -N -L localhost:8880:localhost:80 pinky-ygg
That's the whole setup, now it's time to run goaccess and enjoy the
view. Here is a report I use for analytics on my primary site. The
report focuses on my flesh-and-blood readers, so I excluded Unknown
and Crawlers
user agents.
goaccess \
/var/run/nginx/logs/access-in-trop.log --log-format=vcombined \
-o /srv/nginx/local/analytics/trop.in.html \
--real-time-html --port=17001 --host=localhost \
--ws-url=localhost:8880/websocket/goaccess/trop.in \
--geoip-database=GeoLite2-ASN.mmdb --geoip-database=GeoLite2-City.mmdb \
--unknowns-as-crawlers --ignore-crawlers \
--enable-panel=REFERRERS
Also, I have a bot-fighting report and am still playing with parameters, but I already know where they come from, at what time, and what they are looking for. That means, I accomplished both goals: I counted evil crawlers and nicest hoomans.
One missing feature is the ability to splice and filter the data in runtime. I'd like to have an overview for an adjustable time frame and to filter out requests by regex interactively. However, from goaccess and the web server log, I already get more insights than I wished for—and much more than I could get from "convenient" web analytics.
Have any thoughts or comments? Publish them on Fediverse and reference me or publish them somewhere else on the internet and send me a link.