Implementing a Static Site Generator - Part 1: Overview

c/c++

html

Markdown

Static Site Generator

2024-10-22

This will be a multi-part series that covers various implemenation details of the static site generator used for making this very website.

In this first post I'll give an introduction to what the program does, my motivations and the general approach to its implementation. Any particular details or code examples will follow in future articles.

Some Background

Motivation

Besides a general desire to make everything myself I also required some sort of pipeline that would make it easy to add new content to the site. I'm not keen on doing any server-side scripting and even less so to be using any blog engine, so a static site generator was the remaining alternative.

What is a Static Site Generator?

In simple terms, the opposite to a dynamic site generator, like Wordpress or any social media platform, where the html that gets delivered to the user has been generated on the server at the time of request.

A static site generator pregenerates all the pages locally, which can then be uploaded and served as they are.

Main Features

An overview of its features can also be found in the showcase.

The generator has two main problems to solve.

Ease of writing/adding content.
Ease of modifying the site itself.

Markdown to Html

Html is practically unfeasible as a content production tool and so I would need an intermediate language that resembles plain text but which still allows for defining elements such as headers, links and images.

I went with Markdown as the basis for my syntax. The original specification has some ambiguities which have been addressed by the CommonMark spec, so I used both as reference. I did not follow the spec completely. Some elements have multiple syntaxes of which I just picked one and I also extended the language to include custom elements and passing variables to the generator.

The Template System

During the first implementation of the Markdown converter I simply hardcoded the surrounding html into the program which meant that I had to recompile every time I wanted to modify the site itself.

This of course wouldn't do for the long term so I would need to be able to define the surrounding html externally.
Additionally, a page can consist of multiple contents (like blog posts) using the same enclosing html so there would be instancing of code that would need to be specified somehow.
And further, not all pages will follow the same content structure but would still share elements (like site header) so it would be most ideal if all pages could be derived from the same source file.

To address these issues I came up with the "template system" which is now the core of the static site generator. The syntax and its exact use will be covered in a future post.

Implementation Overview

While the Markdown converter and template system perform completely different tasks they are fundamentally alike in what they do, which is to parse some input and produce an output, and so their processes can be summarized into the following three stages:

Lexing
Parsing
Output

Lexing

Lexing is the process of breaking down a text into its most semantically significant components represented by "tokens" which in turn can be used for efficient parsing.
The semantics would depend on the language but in my case I went with the most generic representation I could think of, the resulting types being:

Whitespace A span of spaces and tabs.
Newline
Word A span of letters.
Number A span of decimal characters.
Each remaining usable ascii character is denoted its own type.
Unknown A span of characters that do not match any other type.

Currently my "lexer" only considers characters in the ascii range, which is sufficient for syntax but may cause problems if content includes characters outside that range, which is not unlikely.

Every parser implemented for the generator use the same generic lexer. Besides Markdown and template syntax, the Markdown converter also supports syntax highlighting in code blocks thus each supported language requires its own parser.
The c/c++ parser adds an additional lexing stage since it requires a different set of tokens to be manageable but still, lexing the generic tokens turns out to be more effective than starting over from plain text.

Parsing

The process where we derive the actual meaning from the text and generate a grammatical representation of the final output, which in the case of our two main actors is a tree-like structure.

In Markdown, "blocks" represent the main structure of the text, such as header, paragraphs, lists and code blocks. Blocks can be nested and so need to be linked in a hierarchy.
Each leaf block then points to an array of spans, which are the elements that make up the content of the block, such as links, images and plain text.

The template file can be divided into nested "scopes", much like in any c-like language. The resulting structure is a tree of tokens that represent the basic elements of the template, which are either text, scope or variable.

Instancing

In addition, the template system utilize another tree-like structure that represent the input data. It's a hierarchy since scopes can have multiple instances but not all instances have the same set of variables.

Output

The block tree is traversed in order of next available leaf. Blocks map directly to html so it's just a matter of appending their opening tag, and later closing tag when we return in the hierarchy.
When a leaf block is reach, it will iterate through all its spans and generate the appropriate html.

The template system does much the same kind of traversal of its token tree, with the exception that traversal will repeat for branches that refer to multiple instances.
Only leaf tokens points to output data, which is written directly.

Closing Notes

Project Scope

I severely underestimated the size of the project, mainly due to unawareness of all the features that I would require and the apparent difficulty of parsing Markdown.
Still, every addition was well worth the time and overall I'm quite pleased with the result.

Future Work

Most of the parsers are still rudimentary in their implementation and not feature complete and will require further development as issues arise.
The generic lexer will probably need to support UTF-8.
I can image future scenarios that would require dynamically generated elements, like comments or selective loading of content, but at the moment those are only speculative.

In general I expect the development of this tool to be continuous as long as I'll be using the site. There's always something that can be added or improved upon. Indeed, there's one major feature still in the works.

That's it for the overview. In the next part we'll have a look at the Markdown converter.

For questions and comments, please send an email or get in touch on LinkedIn.

Index PBR Demo WIP and Writing Offline Shadertoy

Main

Some Background

Motivation What is a Static Site Generator?

Main Features

Markdown to Html The Template System

Implementation Overview

Lexing Parsing

Instancing

Output

Closing Notes

Project Scope Future Work