Introducing Post Statistics

Having all my blog posts as Markdown files enables me to write scripts to parse them. And parsing the YAML headers is really easy, so I can get the date, language and category of posts really easy. And well, once I have data, I can make plots.

There are three fundamental variables that I have for each post:

  1. Date
  2. Category
  3. Language

The date is a continuous variable in principle, but just taking the year makes it an ordinal variable. Category and language are nominal variables. This means that one can use certain encodings for these variables. We have many years, five categories and two languages. The following table summarizes the options.

Variable Encodings
Date X
Year X, Column
Category X, Color
Langugage X, Color, Column

On the Y-axis I will always use the count. Then we can form the following simple plots with one variable on the X-axis and another as color.

First we take a look at the posts over the years, with the category annotated with color:

One can clearly see how I have switched to Nikola in 2020 and started to write many more posts than in the years before where I had a hierarchical site with Sphinx. One can also see that the year 2022 is just starting and that I don't have that many posts published yet. Interestingly there is a cluster of posts in 2015 and 2016. I guess I wrote a lot of articles during this time during my Master's courses. One can also see how the focus has shifted from mainly computer stuff to much more traffic policy stuff.

The next one is essentially the same, just with language encoded as color now:

Here we can see how I have had phases where most posts were in English, and now we are in a phase where most posts are in English. I try to choose the language based on perceived target audience, but that doesn't always work out. With tools like the DeepL Chrome Extension this should be accessible to most visitors, though.

Without the year

Letting go of the year variable for now, and looking just at category and language, we can create the following plot:

We can see how the categories are distributed within each language. German articles are mostly about traffic, whereas English articles are mostly about computer stuff.

And then one can turn it around as well:

We can see how computer stuff is mostly in English, but not exclusively. Science is the same, mostly in English. Traffic however is almost exclusively in German. This makes sense to me, because I am writing about communal traffic policy in my city and around, I think that the target audience are people from Germany or even within my city.

With columns

Finally we can add the columns encoding as well, and encode all three variables at the same time. We first take the language as columns.

There is not that much to see what we haven't seen before, I'd think. Regrouping this shows the shift of languages over time, with German becoming the focus.

And then one can make one row per category to see how each category has evolved over time.

Conclusion

These plots are always generated to the latest state and linked to in the navigation now.

I have yet to decide which plots are most useful. Maybe I will only present a few of them on the permanent page going forward.