Scheme Static Site Generators Review

Introduction

Static site generator is a program, which accepts text files as input and produces static web pages as output. It can be useful in various scenarios: for building blog, book, documentation, project or personal page for example.

There are a few SSGs written in Scheme available in the wild, namely Haunt, Skribilo and Hyde and two more for Racket: Pollen and Frog, which are outside of Guile ecosystem, but still quite close and can be taken as a source of inspiration.

Hyde doesn't seem to be maintained, but the source code still available and even an attempt to go further/reincarnate the project exists.

Skribilo is documentation production toolkit and capable of much more and provides a lot of functionality outside of SSG scope, so we don't cover it in this writing.

Basically we have only one option left at the moment: Haunt and the further discussion will be related to it, but before exploring it, we need to get to common ground and cover the topic of different markup languages.

Markup Languages

Markup languages are used for defining documentation structure, formatting, and relationship between its parts. They play an important role in SSGs, different languages can suite better for different tasks: simple and expressive for human convenience, powerful and capable for intermediate representation and manipulation, compatible and wide-spread for distribution.

SGML, XML, HTML, XHTML

This is a probably most widespread family of markup languages, currently used all over the web. Not always, but usually SSGs create HTML or XHTML documents as an output. Also, it is good to know the relationship between those languages to understand some technical issues we will face later.

SGML (Standard Generalized Markup Language) appeared in 1986 and highly influenced HTML and XML.

XML (Extensible Markup Language) is a meta language, which allows to create new languages (like XHTML), originially developed as a simplification of SGML with rigid and not open for confusion syntax, it's defined in the SGML Doctype language. Often used for representing and exchange data.

HTML is a more user friendly markup language, it's defined in plain english, has more forgiving parsers and interpreters, allows things like uppercased tags, tags without matching closing tag. Such flexibilities can be convenient for users, but it makes it harder to programmaticaly operate on it (parse, process and serialize).

XHTML (XML serialization of HTML) is a version of HTML, which is compliant with an XML grammar. XHTML can be used with XML parsers, tools for querying, transformation should work as well.

While both HTML and XML are influenced by SGML, there is no direct relationship between them and tools for XML can't be used for HTML in general case.

Lightweight Markup Languages

This is another family of markup languages, which are simpler, less verbose, and more human-oriented in general. The notable members are Wiki, Markdown, Org-mode, reStructuredText, BBCode, AsciiDoc.

Often SSGs use those languages for representing the content of pages, posts, etc. Later it is combined with other parts and templates and final output is produced, usually in the form of (X)HTML documents.

Other Markup Languages

There are a number of languages and typesetting systems, which are not covered by the previous two sections: Texinfo, LaTeX, Skribe, Hiccup, SXML. The goals for them can be different: preparing hardcopies, use as an intermediate format, or better suitability for specific needs like writing documentation.

Haunt Overview

Haunt is a simple and hackable SSG written in Scheme, it tries to apply functional programming ideas and the usual approach for building a site with it: prepare SXML page templates, read the content from HTML, Markdown or any other markup files and convert it to SXML, insert the content into templates and serialize resulting pages to HTML. Let's discuss various parts of this process in more details.

SXML

SXML is a representation of XML using S-expressions: lists, symbols and strings, which can be less verbose than the original representation and much easier to work with in Scheme.

SXML is used as an intermediate format for pages and their parts in Haunt, which is relatively easy to process, manipulate and later serialize to target formats like XHTML. It can be crafted by creating s-expressions from Scheme code manually, or programmatically, or with a mix of both. It looks like this:

(define n 3)

(define slide-content
  (get-html-part "./slide3.html" "body>div.content"))

(define (sxml-slide n slide-content)
  `((h2 ,(format #f "Slide number: ~a" n))
    (div (@ (class "slide-content"))
         ,slide-content
         (p "the additional text of the slide"))))))

As it was mentioned in the introduction there is no direct relationship between XML and HTML, and while we usually can parse arbitrary HTML and convert it to SXML without losing significant information, we can't directly use XML parsers for that. For example this HTML is not valid XML:

<input type="checkbox" checked />

Luckily, we can present boolean attributes in full form as hidden="hidden", which is valid both in HTML^4 and XML.

Most lightweight markup languages as well as SSGs usually target HTML, but SSG needs to combine the content, templates and data from various sources and merge them together, so SXML looks like a solid choice for intermediate representation.

The Transformation Workflow

Each site page is built out of a series of subsequently applied trasformations. The transformation is basically a function, which accepts some metadata and data and returns another data (usually SXML) and sometimes additional metadata. Because this transformation is a pure function, a few transformations can be composed in one bigger transformation.

We will cover it in more details in the next section, but readers, templates, layouts, serializers, builders are all just transformations. For example the top level template, called layout just produces SXML for the final page, which can be serialized to the target format. To demonstrate the workflow we will take a bottom-up approach.

Let's take a simple Markdown file, where one wants to write the content of a blog post in human-friendly markup language and let's add metadata to the top of this file: title, publish date, tags.

title: Hello, CommonMark!
date: 2023-05-09 12:00
tags: markdown, commonmark
---

## This is a CommonMark post

CommonMark is a **strongly** defined, *highly* compatible
specification of Markdown. Learn more about CommomMark
[here](http://commonmark.org/).

It can be parsed into metadata (alist) + data (SXML).

=> ((tags "markdown" "commonmark")
    (date . #<date nanosecond: 12 day: 9 month: 5 year: 2023 zone-offset: 14400>)
    (title . "Hello, CommonMark!"))
=> ((h2 "This is a CommonMark post")
    (p "CommonMark is a " (strong "strongly") " defined, "
       (em "highly") " compatible" "\n"
       "specification of Markdown, learn more about CommomMark" "\n"
       (a (@ (href "http://commonmark.org/")) "here") "."))

Metadata+data representing one post is a good unit of operation. With one more transformation (it can be just a template, function adding html, head, body tags and a few more minor things) SSG can produce almost ready for serialization SXML. After deciding on the resulting file name and serialization step, the final HTML is produced.

Some additional transformations can be desirable. For example, substituting relative links to source markup files in the generated html files or something else, but overall it fits this general trasformation workflow well.

Let's zoom out a little and take a look at the directory structure, rather than a single file. Usually, SSGs operate on a number of files and in addition to simple pages can generate composite pages like a list of articles, rss feeds, etc. For this purpose our unit of operation become a list of data+metadata objects: instead of parsing one markup file, SSGs traverses the whole directory and generate a list of objects for future transformation. The overall idea is still the same, but instead many output files get produced from many input files. SSGs produces a list containing only a few or even one output file.

The Implementation

The Entry Point

The entry point in haunt is a site record, which can be created with a function that has the following docstring:

Create a new site object.  All arguments are optional:

TITLE: The name of the site
DOMAIN: The domain that will host the site
SCHEME: Either 'https' or 'http' ('https' by default)
POSTS-DIRECTORY: The directory where posts are found
FILE-FILTER: A predicate procedure that returns #f when a post file
should be ignored, and #f otherwise.  Emacs temp files are ignored by
default.
BUILD-DIRECTORY: The directory that generated pages are stored in
DEFAULT-METADATA: An alist of arbitrary default metadata for posts
whose keys are symbols
MAKE-SLUG: A procedure generating a file name slug from a post
READERS: A list of reader objects for processing posts
BUILDERS: A list of procedures for building pages from posts

The primary thing here is a list of builders. As previously mentioned, a builder is a special case of complex transformation, which does all the work of parsing, templating, generating collections, serialization, etc.

The rest of the list is basically metadata or auxiliary functions. While many of those values can be useful, almost none of them are needed in many cases. scheme and domain are used for rss/atom feeds, which are rare for personal or landing pages. Similiar logic is applicable to the rest of the function arguments, except for maybe build-directory, which almost always make sense.

Providing default values for them is convenient, but making them fields of site records incorporates unecessary assumptions about the nature of the blog and can negatively impact the rest of the implementation by adding unwanted coupling as well as reducing its composability. One of the options to avoid it is to make them values in the default-metadata rather than fields in the record.

Builders, Themes and Readers

Builders are functions, which accept site and posts, apply series of transformations and returns a list of artifacts. Themes and Readers are basically transformations used in the build process. Artifacts are records, which have artifact-writer field, containing a closure writing the actual output file. There are a number of different builders provided out of the box, but the most basic one (static-page) is missing, luckily it's not hard to implement it, so let's do it.

(define* (page-theme #:key (footer %default-footer))
  (theme
   #:layout
   (lambda (site title body)
     `((doctype "html")
       (head
        (meta (@ (charset "utf-8")))
        (title ,(string-append title " — " (site-title site))))
       (body
        (div (@ (class "container"))
             ,body
             ,footer))))
   #:post-template
   (lambda (post)
     `((div ,(post-sxml post))))))

(define* (static-page file destination
                      #:key
                      (theme (page-theme))
                      (reader commonmark-reader))
  "Return a builder procedure that reads FILE into SXML, adjusts it
according to the THEME and serialize to HTML and put it to
build-directory of the site.  DESTINATION is a relative resulting file
path."
  (lambda (site posts)
    (list
     (serialized-artifact
      destination
      (render-post theme site (read-post reader file '()))
      sxml->html))))

As described in a section about transformations, the series of transformations happens here:

read-post basically parses markdown and returns SXML + metadata.
render-post uses post-template from theme to produce SXML post body.
render-post uses layout from theme to produce SXML post body.
serialized-artifact creates a closure, which wraps sxml->html and will later serialize obtained SXML for the page to HTML.

The implementation using already existing APIs is quite easy, but unfortunately not perfect. While functions and records are composable enough to produce desired results, names are quite confusing and tightly related to blogs, but doesn't make much sense in the context of other site types.

Every builder always accepts a list of posts, which were read and transformed into sxml ahead of time. This transformation is implicit and again blog related, which makes the implementation less generic. It could be implemented in the blog builder, but this way other builders like atom-feed won't be able to reuse readed posts from from blog builder and would need to read them again. This is due to the fact, that the build process has three primary steps and looks like this:

;; 1. Prepare site and posts

;; 2. Build artifacts
(builder1 site posts) ;; => artifacts-1
(builder2 site posts) ;; => artifacts-2
(builder3 site posts) ;; => artifacts-3

;; 3. Produce actual site:
(serialize-artifacts
 (append artifacts-1 artifacts-2 artifacts-3))

It makes a build process rigid and makes it harder to compose procedures. The alternative more streamlined process could look like this:

(define readers (list ...))
;; threading macro passes the result of the form
;; as a first argument to the next form
(->
 (make-site ...) ;;=> ((site . <site-record>))
 (read-posts
  "posts/" readers) ;;=> ((posts . <list-of-posts>) (site . <site-record>))
 (static-page "index.md" "index.html") ;;=> ((artifacts <index-artifact>) ...)
 (blog-posts theme) ;;=> ((artifacts <post1-artifact> <index-artifact>) ...)
 (collection "main") ;;=> ((artifacts <coll-artifact> <post1-artifact> ...) ...)
 (atom) ;; takes value from posts and appends a few more artifacts
 (atom-by-tags)
 (serialize-artifacts!))

Just a series of transformations, which enriches one associative data structure. Moreover it makes the implementation of such transformations much more composable:

(define (read-posts o dir readers)
  (let* (;; (dir (site-posts-dir (assoc-ref x 'site))) ; could be
         (posts (map (read-with-readers readers) (files-in dir))))
    (alist-update o 'posts (lambda (x) (append x posts)))))

(define* (static-page o file destination
                      #:key
                      (reader commonmark-reader)
                      (page-layout default-page-layout))
  (let* ((sxml-body (get-sxml (reader from)))
         (sxml-page (page-layout sxml-body))
         (page (serialized-artifact destination sxml-page sxml->html)))
    (alist-update o 'artifacts (lambda (x) (append x (list page))))))

(define* (blog-posts o destination-dir
                     #:key
                     (page-layout default-page-layout)
                     (post-layout post-layout))
  "Implementation for the first posts here is to clearer demonstrate the
idea of reusability."
  (let* ((post (first (assoc-ref o 'posts)))
         (destination (string-append destination-dir (post-file post)))
         (sxml-content (get-sxml post))
         (sxml-body (post-layout from))
         (sxml-page (page-layout sxml-body))
         (page (serialized-artifact destination sxml-page sxml->html)))
    (alist-update o 'artifacts (lambda (x) (append x (list page))))))

(define* (collection o name
                     #:key
                     (filter-function identity)
                     (collection-generator default-collection-generator)
                     (page-layout default-page-layout))
  (let* ((posts (filter-function (assoc-ref o 'posts)))
         (file (string-append name ".html"))
         (sxml-body (collection-generator posts))
         (sxml-page (page-layout sxml-body))
         (collection (serialized-artifact file sxml-page sxml->html)))
    (alist-update o 'artifacts (lambda (x) (append x (list collection))))))

The naming of intermediate transformations is much more suitable (no notion of the post in static-page builder), the transformations are more atomic and it's easier to reuse them (page-layout and similiar) and there is no need to combine them into records like theme, it's easier to restructure complex transformations, for example there is an option to make a collection a part of blog builder or be a separate step as in example above, there is no need to special case read and serialize steps, the read step can skip posts, which are flagged as drafts or have some other advanced logic, now it's possible to build a page, which relies on the content of previous steps, for example a collection of generated rss/atom links.

However, such implementation has its own flaws: more flexibility and less rigid structure can lead to more user mistakes and a steeper learning curve. The original implementation could theoretically run builders in parallel, but one will need to implement it on the user or builder side.

Readers

As a component of the build process we encountered a step, where the file within the markup language is read by readers. There are two parts for it: reading metadata and reading actual content. Let's cover implementation details for them.

Metadata

As shown in the example code snippet in the section related to transformation, one can provide additional metadata in a simple key-value format delimited by --- from the content of the markup file. There are two main issues with the implementation, let's discuss them.

The metadata is required for built-in readers and even if one doesn't want to set any values, they have to add --- at the beginning of the file. This requirement is not needed and could be easily avoided.

The metadata reader simply accepts colon-delimited key-value pairs. It is potentially not be as flexible as yaml frontmatter. Metadata in such format usually is not a part of the markup grammar and that means files are written in an invalid markup. However, it's not a big deal, as readers can use custom metadata parsers.

Guile-Commonmark and Tree-Sitter

Guile-Commonmark is used in Haunt by default to parse markdown files in SXML, it doesn't support embedded html, tables, footnotes and comments, so it can be quite inconvenient for many use cases. It somehow works and serves basic needs and more advanced use cases can be potentially implemented with more feature full libraries like a hypothetical guile-ts-markdown (tree-sitter based markdown parser).

Conclusion

Haunt is the primary player in the Scheme static site generators arena at the moment of this writing. It gives all the basics to get up and running. The number of available learning resources in the wild are much smaller than for similiar solutions from other languages ecosystems, but provided documentation and source code is enough for a seasoned schemer to start with. Exploring the whole project is just a matter of hours, this is not possible with codebase of hugo or jekyll.

The functionality can be lacking in some cases, but due to the hackable nature of the project, it is possible to gradually build upon the basics and as well as any future needs. Unfortunately, the current state of the Scheme ecosystem and Guile in particular feels behind more mainstream languages, but hopefully the popularity of Guile will reach a higher level and the ecosystem will start growing in the nearest future.

Future Work

There are a number of improvement points for Haunt in particular, and Guile Scheme in general. We need more complete tooling for working with markup languages like org, md, html, yaml, etc. As a generic solution, tree-sitter seems like a good candidate to quickly cover this huge area.

More streamlined and composable build processes for Haunt as described in the Builders section could add to haunt's flexibility in general as well as encouraging the use of reusable components.

Possible integrations with other tools like Guix, REPL, Emacs for easier deployment, better caching, more interactive development and other goodies.

More documentation, materials and tools for possible workflows and use cases from citation capabilites and automatic url resolution to on-huge-file workflows and org-roam integration.

Acknowledgments. Kudos to David Thompson for making Haunt, Erik Edrosa for making guile-commonmark and jgart for extensive and careful editing of the post.