Josef “Jeff” Sipek

CBOR vs. JSON vs. libnvpair

My blahg uses nvlists for logging extra information about its operation. Historically, it used Sun libnvpair. That is, it used its data structures as well as the XDR encoding to serialize the data to disk.

A few months ago, I decided to replace libnvpair with my own nvlist implementation—one that was more flexible and better integrated with my code. (It is still a bit of a work-in-progress, but it is looking good.) The code conversion went smoothly, and since then all the new information was logged in JSON.

Last night, I decided to convert a bunch of the previously accumulated libnvpair data files into the new JSON-based format. After whipping up a quick conversion program, I ran it on the data. The result surprised me—the JSON version was about 55% of the size of the libnvpair encoded input!

This piqued my interest. I re-ran the conversion but with CBOR (RFC 7049) as the output format. The result was even better with the output being 45% of libnvpair’s encoding.

This made me realize just how inefficient libnvpair is when serialized. At least part of it is because XDR (the way libnvpair serializes data) uses a lot of padding, while both JSON and CBOR use a more compact encoding for many data types (e.g., an unsigned number in CBOR uses 1 byte for the type and 0, 1, 2, 4, or 8 additional bytes based on its magnitude, while libnvpair always encodes a uint64_t as 8 bytes plus 4 bytes for the type).

Since CBOR is 79% of JSON’s size (and significantly less underspecified compared to the minefield that is JSON), I am hoping to convert everything that makes sense to CBOR. (CBOR being a binary format makes it harder for people to hand-edit it. If hand-editing is desirable, then it makes sense to stick with JSON or other text-based formats.)

The Data & Playing with Compression

The blahg-generated dataset that I converted consisted of 230866 files, each containing an nvlist. The following byte counts are a simple concatenation of the files. (A more complicated format like tar would add a significant enough overhead to make the encoding efficiency comparison flawed.)

Format Size % of nvpair
nvpair 471 MB 100%
JSON 257 MB 54.6%
CBOR 203 MB 45.1%

I also took each of the concatenated files and compressed it with gzip, bzip2, and xz. In each case, I used the most aggressive compression by using -9. The percentages in parentheses are comparing the compressed size to the same format’s uncompressed size. The results:

Format Uncomp. gzip bzip2 xz
nvpair 471 MB 37.4 MB (7.9%) 21.0 MB (4.5%) 15.8 MB (3.3%)
JSON 257 MB 28.7 MB (11.1%) 17.9 MB (7.0%) 14.5 MB (5.6%)
CBOR 203 MB 26.8 MB (13.2%) 16.9 MB (8.3%) 13.7 MB (6.7%)

(The compression ratios are likely artificially better than normal since each of the 230k files has the same nvlist keys.)

Since tables like this are hard to digest, I turned the same data into a graph:

CBOR does very well uncompressed. Even after compressing it with a general purpose compression algorithm, it outperforms JSON with the same algorithm by about 5%.

I look forward to using CBOR everywhere I can.

blahgd fmt 3 Limitations

Recently, I described the four post formats (fmt 0–3) that my blogging software ever supported. This is the promissed follow up.

I am reasonably happy with fmt 3, but I do see some limitations that I’d like to address. This will inevitably lead to fmt 4. fmt 4 will still be LaTeX-like, but it will address the following issues present in fmt 3.

HTML rendering is context in-sensitive: It turns out that there are many instances where blindly converting a character to whatever is necessary to render it properly in HTML is the wrong thing to do. For example, there are many Wikipedia articles that contain an apostrophe in the name. In a recent link dump, I mentioned Wikipedia article: Fubini’s theorem. The apostrophe used in the URL must be an ASCII code 0x27 and not a Unicode 0x2019 (aka. ’). If I forget to do this and type:

\wiki{Fubini's theorem}

The link text will look nice (thanks to ’), but the link URL will be broken because there is no Wikipedia article called “Fubini’s Theorem”. To work around this, I end up using:

\wiki[Fubini's theorem]{Fubini\%27s theorem}

This will use ’ for the link text and %27 in the link URL. It’s ugly and not user friendly.

The “wiki” command isn’t the only one where special behavior makes sense.

Sadly, the fmt 3 parser just isn’t suited to allow for this. It is the yacc grammar that converts single and double quotes (among other characters) to the appropriate HTML escapes. Eventually, this converted string gets fed to the function that turns the one or two arguments into a link to Wikipedia. At this point, the original raw text is lost. However that is exactly what is needed to link to the proper place!

Metadata is duplicated: Another issue is that it would be nice to keep the metadata of the post with the post text itself. LaTeX-like markup lends itself very nicely to this. For example:

\title{Post Formats in blahgd}
\tag{reply}
\tag{blahg}
\published{2015-10-05 19:04}

Unfortunately, it’d take some unpleasant hacks to stash these values (unmangled!) in the structure keeping track of the post while a request is processed. So, even for fmt 3, I have to keep metadata in a separate metadata file. (The thoughts and design behind the metadata file could easily fill a post of its own.)

At first, I did not mind this extra copy of the metadata. However, over time it became obvious that this duplication leads to the age-old consistentcy problem. It is tempting to solve this problem by restricting which copy can be modified. Given that being able to edit everything with a text editor is a key goal of blahgd, restricting what can be editted is the wrong solution. Instead, eliminating all but one copy of the metadata is the proper way to solve this.

Future Plans

To solve these limitations with fmt 3, I am planning for the next format’s parser to do less. Instead of containing the entire translation code in the lex and yacc files, I will have the parser produce a parse tree. Then, the code will transform the parse tree into an AST which will then be transformed into whatever output is necessary (e.g., HTML, Atom, RSS). This will take care of all of the above mentioned issues since the rendering pass will have access at the original text (or equivalent of it). Yes, this sounds a bit heavy-handed but I really think it is the way to go. (For what it’s worth, I am not the only one who thinks that converting markup to HTML should go through an AST.)

Obviously, one of the goals with fmt 4 is to keep as close to fmt 3 as is practical. This will allow a quick and easy migration of the over 400 posts currently using it.

Post Formats in blahgd

After reading What creates a good wikitext dialect depends on how it’s going to be used, I decided to write a short post about how I handled the changing needs that my blahg experienced.

One of the things that I implemented in my blogging software a long time ago was the support for different flavors of markup. This allowed me to switch to a “saner” markup without revisiting all the previous entries. Since every post already has metadata (title, publication time, etc.) it was easy enough to add another field (“fmt”) which in my case is just a simple integer identifying how the post contents should be parsed.

Over the years, there have been four formats:

fmt 0 (removed)
Wordpress compat
fmt 1
“Improved” Wordpress format
fmt 2
raw html
fmt 3
LaTeX-like (my current format of choice)

The formats follow a pretty natural progression.

It all started in January 2009 as an import of about 250 posts from Wordpress. Wordpress stores the posts in a html-esque format. At the very least, it inserts line and paragraph breaks. If I remember correctly, one newline is a line break and two newlines are a paragraph break, but otherwise is passes along HTML more or less unchanged. Suffice to say, I was not in the mood to rewrite all the posts in a different format right away so I implemented something that resembled the Wordpress behavior. I did eventually end up converting these posts to a more modern format and then I removed support for this one.

The next format (fmt 1) was a natural evolution of fmt 0. One thing that drove me nuts about fmt 0 was the handling of line breaks. Since the whole point of blahgd was to allow me to compose entries in a text editor (vim if you must know) and I like my text files to be word wrapped, the transformation of every newline to <br/> was quite annoying. (It would result in jagged lines in the browser!) So, about a month later (February 2009), I made a trivial change to the parsing code to treat a single newline as just whitespace. I called this changed up parser fmt 1. (There are currently 24 posts using this format.)

A couple of months after I added fmt 1, I came to the conclusion that in some cases I just wanted to write raw HTML. And so fmt 2 was born (April 2009). (There are currently 5 posts using this format.)

After using fmt 2 for about a year and a half, I concluded that writing raw HTML is not the way to go. Any time I wanted to change the rendering of a certain thing (e.g., switching between <strong> and <b>), I had to revisit every post. Since I am a big fan of LaTeX, I thought it would be cool to have a LaTeX-like markup. It took a couple of false starts spanning about six months but eventually (February 2011) I got the lex and yacc files working well enough. (There are currently 422 posts using this format.)

While I am reasonably happy with fmt 3, but I do see some limitations that I’d like to address. This will inevitably lead to fmt 4. I am hoping to make a separate post about the fmt 3 limitations and how fmt 4 will address them sometime in the (hopefully) near future.

Moving and Downtime

I’ll be moving my server over the next couple of days. I’m working on an email setup to make sure there’s no interruption there. The website and the blahg will however be down until Wednesday evening. Sorry for any inconvenience this may cause.

Comment Spam Filtering Experiments

Just a heads up, I’m getting fed up with all the comment spam that ends up on the moderation queue. So, I’m working on some code to reject comment spam before it hits it. As the title for this post implies, these are experiments; I’ll try my best not to reject any valid comments. I appologize if a valid comment does get rejected.

If you end up being a victim of my overzealous filters, please email me: jeffpc@josefsipek.net.

Post Preview

One of the blogs I’ve been reading for a few months now just had a post about partial vs. full entries on blog front pages. Since I have some opinions on the subject, I decided to comment. My response turned into something sufficiently content-full that I decided that my blahg would be a better place for it. Sorry, Chris :P

First of all, my blog doesn’t support partial post display because… technical reasons. (The sinking feeling of discovering a design mistake in your code really resonated with me about this exact thing.) With that said, I don’t think that partial display is necessarily bad. I feel like any reasonable (this is of course subjective) blogging software should follow these rules:

  1. if we’re displaying a atom/rss feed, display full post
  2. if we’re displaying a single post, display full post
  3. if the post contains magical marker that denotes where to stop the preview, display everything above the marker
  4. display full post

I really dislike when the feeds give me the first sentence and I have to click a link to read more. At the very least, it is inconvenient, and in extreeme cases it feels outright insulting.

I think the post-by-post-basis Chris suggests is the way to go, but in the absence of a user-defined division point I would display the whole thing.

Do I write many posts where I wish I could use this magical marker? No. If that were the case, I’d make supporting this a higher priority. However, there have been a handful of times where I believe that the rest of the post is uninteresting to…well…just about everyone and it is really long. So long, that you might get bored trying to scroll past it. (If you are reading my blahg, I don’t want you to be bored because you had to scroll for too long to skip over an entry — you are my guest, and I am here to entertain you.) This is the time I believe displaying a partial post is good.

I’m hoping that eventually I’ll wrestle with my blogging software sufficiently to eliminate the technical reasons preventing me from introducing and processing this special marker. Not that you’ll really notice anything different. :)

Comment Spam, Part Deux

This is a continuation of Comment Spam from almost 6 months ago. Recently, I noticed that a lot of the comment spam I get appears to contain MD5 hashes. Here’s an example, with all the spam parts smudged.

MD5 comment spam

I wonder why the spammers decided to include a hash in the comment. It could be an interesting way to eliminate duplication of spam — but I’m not sure that it would help them all that much.

And just if are curious, the IP is in China’s range.

Comment Spam

So, I’ve been getting a lot of comment spam lately — you can’t see it because I moderate it all. It is not a fun activity, so I looked at the available anti-comment spam plugins for wordpress. I have only one thing to say: They all suck. I looked at at least 7 different ones, which were neat in different ways, but they all suck. They force you to modify some “magic” php files. Sure, I understand the code, but the thing I don’t understand is the fact why? I’m no wordpress expert, but I am sure that there is a way to hook into it somewhere. This would easily allow me to upgrade wordpress at will without the fear of having to re-“apply” these hacky “plugins.” Shame. If I didn’t have an aversion to PHP, I might have written one, good one myself. I hope that the plugin authors/the wordpress folks see this, and fix it.

Powered by blahgd