Primitive Obsession
On the Refactoring Guru website, there is one anti-pattern called “primitive obsession”, which is about the obsession with primitive types. In Python I would define “primitive types” as “jsonable types”, the ones that can be represented in JSON. This is a pattern that I have observed a couple of times, and I want to show a little example from my own code.
I have a bunch of Python scripts to handle my blog posts. And naturally I have a class which represents a blog post:
@dataclasses.dataclass() class Post: meta: Dict[str, Any] text: str path: str auxiliary_files: List[str] tags: List[str]
This data class would be called struct in C and C++, POD in Java, list in R. Such data structures seem rather nice. They can easily be serialized into JSON (just use json.dumps(post.__dict__)
) or YAML. And they can easily be restored (use Post(**json.loads(s))
. And if new fields are added, they are automatically taken care of. Also one can easily print the state of the program everywhere with a print
, and one can just export JSON from the debugger, if needed. What's not to love about this?
There are three issues that I have with that: (1) One is forced to use primitive types all the way down; (2) there is no encapsulation; and (3) there is no static type checking possible.
For instance the path
attribute would better be a pathlib.Path
object instead of a primitive string. But that is not jsonable, and needs to either have a special JSON decoder and encoder, or we write a function which imports and exports Post
objects into primitives. This would decouple serialization from the internal representation of the core entity that this thing is.
Separating core data from its serialization also gives me the opportunity to version the serialization. This way I can add (optional) fields to the core entity and still serialize it into the old format (losing a bit of data), and load the old format. Especially with APIs that is a big advantage.
Regarding the encapsulation now. If one is content with C-style programming, there is not so much wrong about that. In C there is no encapsulation anyway. Everything is just pointers and structs, so such an approach is completely normal. One has a bunch of scattered free functions which are weakly associated via their name; think of fopen
, fread
, fseek
, fprintf
and fclose
. The FILE
structs are manipulated by these functions, and invariants are only guaranteed if one doesn't manipulate the FILE
objects manually.
The third issue is about static type checking. I can make sure that Post.text
is always a string instance. But what about Post.meta
? It is a dict which can contain whatever I care to write into the YAML header of my blog posts. Surely this YAML header is part of the post, but I have to know what is inside that in other functions. For instance I have expressions like the following scattered around my scripts to check whether a post is a draft:
post.meta.get("status", "") == "draft"
I have to use dict.get()
here because not only blog post has the status
key in the YAML header. And then I check against the string "draft"
, which is a primitive type.
What I would really prefer in terms of encapsulation and rich types is either of the following. A method which just tells me whether it is a draft. All the details are encapsulated there:
post.is_draft()
And if I want to expose the status, I would prefer an enumeration class for the statuses and another data class for the meta data such that I can at least do this:
post.meta.status == PostStatus.draft
This would be type safe, I couldn't have a typo in the string "draft"
and I could even search for usages of PostStatus.draft
or PostMeta.status
to figure out where they are used.
With the nested data classes and rich types, one cannot simply serialize the magic dict (post.__dict__
) as JSON. But I can write an export function if I need that.
Refactoring
In order to get rid of these problems, I turn Post
into an actual class with enumerations:
class PostState(enum.Enum): published = 1 featured = 2 draft = 3 private = 4 class BoardState(enum.Enum): in_progress = "In Bearbeitung" waiting = "Warten" ready = "Bereit" class PostLanguage(enum.Enum): german = "Deutsch" english = "Englisch" undefined = "" class Post: def __init__(self, path: pathlib.Path): self.path: pathlib.Path = path fm = frontmatter.load(path) self._meta = fm.metadata self.text: str = fm.content self.title: str = self._meta["title"] self.date: datetime.date = self._meta["date"] self.updated: Optional[datetime.date] = self._meta.get("updated", None) self.category: str = self._meta["category"] self.tags: List[str] = self._meta.get("tags", "").split(", ") self.status = PostState[self._meta.get("status", "published")] self.preview_image: Optional[str] = self._meta.get("previewimage", None) self.language: PostLanguage = find_language(self.tags) self.latlon: Optional[str] = self._meta.get('latlon', None) self.board: Optional[BoardState] = None if "board" in self._meta: for x in BoardState: if x.value == self._meta["board"]: self.board = x elif self.is_draft: self.board = BoardState.in_progress self.auxiliary_files: List[pathlib.Path] = list(path.parent.glob("*.*")) def write(self) -> None: self._meta["tags"] = ", ".join(self.tags) self._meta["category"] = self.category fm_post = frontmatter.Post(self.text, **self._meta) with open(self.path, "wb") as f: frontmatter.dump(fm_post, f) @property def is_draft(self): return self.status == PostState.draft
The loading is now encapsulated. The writing is encapsulated. I have a nice property to check whether the post is a draft. And it allows to turn various hideous lines into pretty ones. Takes for instance this one, which determines a string representation for the language:
"English" if "English" in post.meta["tags"] else "German"
Now I can just take the enum value:
post.language.value
So this class might be more complicated than the data class before. But all other code is just so much cleaner and better to work with, that I really like how it turned out after removing the primitives.