Primitive Obsession

On the Refactoring Guru website, there is one anti-pattern called “primitive obsession”, which is about the obsession with primitive types. In Python I would define “primitive types” as “jsonable types”, the ones that can be represented in JSON. This is a pattern that I have observed a couple of times, and I want to show a little example from my own code.

I have a bunch of Python scripts to handle my blog posts. And naturally I have a class which represents a blog post:

@dataclasses.dataclass()
class Post:
    meta: Dict[str, Any]
    text: str
    path: str
    auxiliary_files: List[str]
    tags: List[str]

This data class would be called struct in C and C++, POD in Java, list in R. Such data structures seem rather nice. They can easily be serialized into JSON (just use json.dumps(post.__dict__)) or YAML. And they can easily be restored (use Post(**json.loads(s)). And if new fields are added, they are automatically taken care of. Also one can easily print the state of the program everywhere with a print, and one can just export JSON from the debugger, if needed. What's not to love about this?

There are three issues that I have with that: (1) One is forced to use primitive types all the way down; (2) there is no encapsulation; and (3) there is no static type checking possible.

For instance the path attribute would better be a pathlib.Path object instead of a primitive string. But that is not jsonable, and needs to either have a special JSON decoder and encoder, or we write a function which imports and exports Post objects into primitives. This would decouple serialization from the internal representation of the core entity that this thing is.

Separating core data from its serialization also gives me the opportunity to version the serialization. This way I can add (optional) fields to the core entity and still serialize it into the old format (losing a bit of data), and load the old format. Especially with APIs that is a big advantage.

Regarding the encapsulation now. If one is content with C-style programming, there is not so much wrong about that. In C there is no encapsulation anyway. Everything is just pointers and structs, so such an approach is completely normal. One has a bunch of scattered free functions which are weakly associated via their name; think of fopen, fread, fseek, fprintf and fclose. The FILE structs are manipulated by these functions, and invariants are only guaranteed if one doesn't manipulate the FILE objects manually.

The third issue is about static type checking. I can make sure that Post.text is always a string instance. But what about Post.meta? It is a dict which can contain whatever I care to write into the YAML header of my blog posts. Surely this YAML header is part of the post, but I have to know what is inside that in other functions. For instance I have expressions like the following scattered around my scripts to check whether a post is a draft:

post.meta.get("status", "") == "draft"

I have to use dict.get() here because not only blog post has the status key in the YAML header. And then I check against the string "draft", which is a primitive type.

What I would really prefer in terms of encapsulation and rich types is either of the following. A method which just tells me whether it is a draft. All the details are encapsulated there:

post.is_draft()

And if I want to expose the status, I would prefer an enumeration class for the statuses and another data class for the meta data such that I can at least do this:

post.meta.status == PostStatus.draft

This would be type safe, I couldn't have a typo in the string "draft" and I could even search for usages of PostStatus.draft or PostMeta.status to figure out where they are used.

With the nested data classes and rich types, one cannot simply serialize the magic dict (post.__dict__) as JSON. But I can write an export function if I need that.

Refactoring

In order to get rid of these problems, I turn Post into an actual class with enumerations:

class PostState(enum.Enum):
    published = 1
    featured = 2
    draft = 3
    private = 4


class BoardState(enum.Enum):
    in_progress = "In Bearbeitung"
    waiting = "Warten"
    ready = "Bereit"


class PostLanguage(enum.Enum):
    german = "Deutsch"
    english = "Englisch"
    undefined = ""


class Post:
    def __init__(self, path: pathlib.Path):
        self.path: pathlib.Path = path

        fm = frontmatter.load(path)
        self._meta = fm.metadata
        self.text: str = fm.content

        self.title: str = self._meta["title"]
        self.date: datetime.date = self._meta["date"]
        self.updated: Optional[datetime.date] = self._meta.get("updated", None)
        self.category: str = self._meta["category"]
        self.tags: List[str] = self._meta.get("tags", "").split(", ")
        self.status = PostState[self._meta.get("status", "published")]
        self.preview_image: Optional[str] = self._meta.get("previewimage", None)
        self.language: PostLanguage = find_language(self.tags)
        self.latlon: Optional[str] = self._meta.get('latlon', None)

        self.board: Optional[BoardState] = None
        if "board" in self._meta:
            for x in BoardState:
                if x.value == self._meta["board"]:
                    self.board = x
        elif self.is_draft:
            self.board = BoardState.in_progress

        self.auxiliary_files: List[pathlib.Path] = list(path.parent.glob("*.*"))

    def write(self) -> None:
        self._meta["tags"] = ", ".join(self.tags)
        self._meta["category"] = self.category
        fm_post = frontmatter.Post(self.text, **self._meta)
        with open(self.path, "wb") as f:
            frontmatter.dump(fm_post, f)

    @property
    def is_draft(self):
        return self.status == PostState.draft

The loading is now encapsulated. The writing is encapsulated. I have a nice property to check whether the post is a draft. And it allows to turn various hideous lines into pretty ones. Takes for instance this one, which determines a string representation for the language:

"English" if "English" in post.meta["tags"] else "German"

Now I can just take the enum value:

post.language.value

So this class might be more complicated than the data class before. But all other code is just so much cleaner and better to work with, that I really like how it turned out after removing the primitives.