Josef “Jeff” Sipek

blahgd fmt 3 Limitations

Recently, I described the four post formats (fmt 0–3) that my blogging software ever supported. This is the promissed follow up.

I am reasonably happy with fmt 3, but I do see some limitations that I’d like to address. This will inevitably lead to fmt 4. fmt 4 will still be LaTeX-like, but it will address the following issues present in fmt 3.

HTML rendering is context in-sensitive: It turns out that there are many instances where blindly converting a character to whatever is necessary to render it properly in HTML is the wrong thing to do. For example, there are many Wikipedia articles that contain an apostrophe in the name. In a recent link dump, I mentioned Wikipedia article: Fubini’s theorem. The apostrophe used in the URL must be an ASCII code 0x27 and not a Unicode 0x2019 (aka. ’). If I forget to do this and type:

\wiki{Fubini's theorem}

The link text will look nice (thanks to ’), but the link URL will be broken because there is no Wikipedia article called “Fubini’s Theorem”. To work around this, I end up using:

\wiki[Fubini's theorem]{Fubini\%27s theorem}

This will use ’ for the link text and %27 in the link URL. It’s ugly and not user friendly.

The “wiki” command isn’t the only one where special behavior makes sense.

Sadly, the fmt 3 parser just isn’t suited to allow for this. It is the yacc grammar that converts single and double quotes (among other characters) to the appropriate HTML escapes. Eventually, this converted string gets fed to the function that turns the one or two arguments into a link to Wikipedia. At this point, the original raw text is lost. However that is exactly what is needed to link to the proper place!

Metadata is duplicated: Another issue is that it would be nice to keep the metadata of the post with the post text itself. LaTeX-like markup lends itself very nicely to this. For example:

\title{Post Formats in blahgd}
\published{2015-10-05 19:04}

Unfortunately, it’d take some unpleasant hacks to stash these values (unmangled!) in the structure keeping track of the post while a request is processed. So, even for fmt 3, I have to keep metadata in a separate metadata file. (The thoughts and design behind the metadata file could easily fill a post of its own.)

At first, I did not mind this extra copy of the metadata. However, over time it became obvious that this duplication leads to the age-old consistentcy problem. It is tempting to solve this problem by restricting which copy can be modified. Given that being able to edit everything with a text editor is a key goal of blahgd, restricting what can be editted is the wrong solution. Instead, eliminating all but one copy of the metadata is the proper way to solve this.

Future Plans

To solve these limitations with fmt 3, I am planning for the next format’s parser to do less. Instead of containing the entire translation code in the lex and yacc files, I will have the parser produce a parse tree. Then, the code will transform the parse tree into an AST which will then be transformed into whatever output is necessary (e.g., HTML, Atom, RSS). This will take care of all of the above mentioned issues since the rendering pass will have access at the original text (or equivalent of it). Yes, this sounds a bit heavy-handed but I really think it is the way to go. (For what it’s worth, I am not the only one who thinks that converting markup to HTML should go through an AST.)

Obviously, one of the goals with fmt 4 is to keep as close to fmt 3 as is practical. This will allow a quick and easy migration of the over 400 posts currently using it.

Post Formats in blahgd

After reading What creates a good wikitext dialect depends on how it’s going to be used, I decided to write a short post about how I handled the changing needs that my blahg experienced.

One of the things that I implemented in my blogging software a long time ago was the support for different flavors of markup. This allowed me to switch to a “saner” markup without revisiting all the previous entries. Since every post already has metadata (title, publication time, etc.) it was easy enough to add another field (“fmt”) which in my case is just a simple integer identifying how the post contents should be parsed.

Over the years, there have been four formats:

fmt 0 (removed)
Wordpress compat
fmt 1
“Improved” Wordpress format
fmt 2
raw html
fmt 3
LaTeX-like (my current format of choice)

The formats follow a pretty natural progression.

It all started in January 2009 as an import of about 250 posts from Wordpress. Wordpress stores the posts in a html-esque format. At the very least, it inserts line and paragraph breaks. If I remember correctly, one newline is a line break and two newlines are a paragraph break, but otherwise is passes along HTML more or less unchanged. Suffice to say, I was not in the mood to rewrite all the posts in a different format right away so I implemented something that resembled the Wordpress behavior. I did eventually end up converting these posts to a more modern format and then I removed support for this one.

The next format (fmt 1) was a natural evolution of fmt 0. One thing that drove me nuts about fmt 0 was the handling of line breaks. Since the whole point of blahgd was to allow me to compose entries in a text editor (vim if you must know) and I like my text files to be word wrapped, the transformation of every newline to <br/> was quite annoying. (It would result in jagged lines in the browser!) So, about a month later (February 2009), I made a trivial change to the parsing code to treat a single newline as just whitespace. I called this changed up parser fmt 1. (There are currently 24 posts using this format.)

A couple of months after I added fmt 1, I came to the conclusion that in some cases I just wanted to write raw HTML. And so fmt 2 was born (April 2009). (There are currently 5 posts using this format.)

After using fmt 2 for about a year and a half, I concluded that writing raw HTML is not the way to go. Any time I wanted to change the rendering of a certain thing (e.g., switching between <strong> and <b>), I had to revisit every post. Since I am a big fan of LaTeX, I thought it would be cool to have a LaTeX-like markup. It took a couple of false starts spanning about six months but eventually (February 2011) I got the lex and yacc files working well enough. (There are currently 422 posts using this format.)

While I am reasonably happy with fmt 3, but I do see some limitations that I’d like to address. This will inevitably lead to fmt 4. I am hoping to make a separate post about the fmt 3 limitations and how fmt 4 will address them sometime in the (hopefully) near future.

Powered by blahgd