blahgd fmt 3 Limitations
Recently, I described the four post formats (fmt 0–3) that my blogging software ever supported. This is the promissed follow up.
I am reasonably happy with fmt 3, but I do see some limitations that I’d like to address. This will inevitably lead to fmt 4. fmt 4 will still be LaTeX-like, but it will address the following issues present in fmt 3.
HTML rendering is context in-sensitive: It turns out that there are many instances where blindly converting a character to whatever is necessary to render it properly in HTML is the wrong thing to do. For example, there are many Wikipedia articles that contain an apostrophe in the name. In a recent link dump, I mentioned Fubini’s theorem. The apostrophe used in the URL must be an ASCII code 0x27 and not a Unicode 0x2019 (aka. ’). If I forget to do this and type:
\wiki{Fubini's theorem}
The link text will look nice (thanks to ’), but the link URL will be broken because there is no Wikipedia article called “Fubini’s Theorem”. To work around this, I end up using:
\wiki[Fubini's theorem]{Fubini\%27s theorem}
This will use ’ for the link text and %27 in the link URL. It’s ugly and not user friendly.
The “wiki” command isn’t the only one where special behavior makes sense.
Sadly, the fmt 3 parser just isn’t suited to allow for this. It is the yacc grammar that converts single and double quotes (among other characters) to the appropriate HTML escapes. Eventually, this converted string gets fed to the function that turns the one or two arguments into a link to Wikipedia. At this point, the original raw text is lost. However that is exactly what is needed to link to the proper place!
Metadata is duplicated: Another issue is that it would be nice to keep the metadata of the post with the post text itself. LaTeX-like markup lends itself very nicely to this. For example:
\title{Post Formats in blahgd} \tag{reply} \tag{blahg} \published{2015-10-05 19:04}
Unfortunately, it’d take some unpleasant hacks to stash these values (unmangled!) in the structure keeping track of the post while a request is processed. So, even for fmt 3, I have to keep metadata in a separate metadata file. (The thoughts and design behind the metadata file could easily fill a post of its own.)
At first, I did not mind this extra copy of the metadata. However, over time it became obvious that this duplication leads to the age-old consistentcy problem. It is tempting to solve this problem by restricting which copy can be modified. Given that being able to edit everything with a text editor is a key goal of blahgd, restricting what can be editted is the wrong solution. Instead, eliminating all but one copy of the metadata is the proper way to solve this.
Future Plans
To solve these limitations with fmt 3, I am planning for the next format’s parser to do less. Instead of containing the entire translation code in the lex and yacc files, I will have the parser produce a parse tree. Then, the code will transform the parse tree into an AST which will then be transformed into whatever output is necessary (e.g., HTML, Atom, RSS). This will take care of all of the above mentioned issues since the rendering pass will have access at the original text (or equivalent of it). Yes, this sounds a bit heavy-handed but I really think it is the way to go. (For what it’s worth, I am not the only one who thinks that converting markup to HTML should go through an AST.)
Obviously, one of the goals with fmt 4 is to keep as close to fmt 3 as is practical. This will allow a quick and easy migration of the over 400 posts currently using it.