The Case for Standard Configuration File Formats

Martin Ueding

2017-08-25

Code & Zahlen

The numerical software that I work with is written as command line tools. As such, they get their parameters from command line arguments and/or parameter files. Their output is written to files as well. Since one can redirect standard output into a text file, we will only consider files here.

The format of these files have some sort of format. Usually the output generated using print statements is really ad-hoc and only of an informative nature. It is rather hard to read with a program. We will have a look at the input parameters first.

Command Line Arguments

Positional Arguments

The worst thing from a usability point of view are programs that only have positional arguments. This can be nice for simple programs like cp where it is rather clear what happens. But let's say we want to run a HMC simulation like my su2-hmc program. There are a lot of parameters needed:

Temporal lattice extent
Spatial lattice extent
Molecular dynamics time step
Coupling β
Number of molecular dynamics steps
Number of updates to perform
Skipping of storing configurations
Initial seed
Initial standard deviation

One could just do this:

$ su2-hmc 8 8 0.01 1 100 180 0 0 0.0

The problem is that the order of these arguments is crucial. And since they are all numbers, they are easily interchanged. Then the program still runs, but it is not doing what you wanted.

Also arguments cannot be optional, users will have to supply all of them.

Named Arguments

Some improvement is achieved by using named arguments. Then a call to the program would look like this:

$ su2-hmc --time 8 --space 8 --time-step 0.01 --beta 1 --steps 100 \
    --total 180 --skip 0 --seed 0 --hot-start-std 0.0

This is more readable, but it now needs two lines. For a more complex setup, the line will be significantly longer. Arguments can have a default value, so one does not need to supply them any more. There is no danger of interchanging values.

Remembering the Invocation

A few weeks or months after the simulation ran, somebody will ask you what parameters you have used. This is a completely reasonable question, but it might be rather hard to answer. If the program just took the options on the command line, they are not written down. They might be in the shell history, but with Bash this can be rather fragile. Also your co-workers might be able to see the files on a shared filesystem, but they cannot look into your shell history. And even if they could, they still would not know the directory where the command was executed.

One way to prevent this is by writing all commands to a file into the same directory. This way one can easily check which parameters have been used. The program can be changed such that it outputs all command line arguments to a file.

I have done this, but in retrospect this does not make much sense. Instead of writing a command line parser in the program and then also writing out the options, it would be much easier to just write a short shell script that then invokes the program. This way you also have an auditing trail but no need to change your program. The same approach can also be used with every program.

Configuration Files

The shell script is an improvement, but it is not really a configuration file. One cannot have a hierarchy and also one cannot have comments in it. Lots of configuration files on Linux work in a way that they already have every option set to the default value but with a comment. This way the user can learn about the options and override values easily. Additionally, configuration files provide an audit trail.

I'd even go so far and say that the program should only take the path to the configuration file and nothing more. Exceptions are options which really have no impact on reproducibility.

Custom Configuration File

Unfortunately, a custom configuration file format is quickly invented. One can start with something like this:

length-time = 8
length-space = 8
time-step = 0.01

Writing a parser for this is not hard either. One just split the line at =, trims off excessive whitespace and converts the argument from string to the desired type. A simple Python implementation would look like the following:

with open(config_file) as file:
    options = {}
    for line in file:
        parts = line.split('=')
        key = parts[0].strip()
        value = eval(parts[1])
        options[key] = value

It will work, but passing untrusted input to eval will make for great security risks. If somebody gave you a configuration file with malicious code in it, that would then be executed on your account. But that could be changed in the implementation.

Another problem is that this does not allow for comments in the file. A user might expect that lines starting with # will be ignored. It to some extent will be because that adds a key into the dictionary which will not be looked into. A = sign in the value part will be a problem, though. With this implementation it will be silently ignored.

There are other examples of custom formats that have their issues: One program parses the file exactly once and expects the options to be in a particular order. If one gets this wrong, the options are not read correctly.

At some point one might notice that the format lacks features. Usually a hot-fix is implemented into the parser without thinking whether the format is still sensible.

The format is only used within the program, it effectively is intertwined with the program. Therefore no other program can read the file. If it is a simulation package in C++, one cannot read the parameter file with an analysis program in R or Python.

Standard Format

Using a standardized format has many advantages:

You don't need to write your own parser. That has been done already in many programming languages and can easily be used. Also the parser probably is faster than a self-written one and is a bit more robust.
The formats have various extra features like comments, hierarchy, native representation of data types. These might not seem to be required now, but it is good if the format has some complexity in reserve in case it is needed.
Text editors also know about these formats and can provide syntax highlighting and indentation.
The configuration file can be parsed with different programs. The interpretation of the values of course differs, but an analysis program can simply read the parameters of the simulation without being told.

INI

A very simple format is INI. It was used in Windows extensively, now they use the registry. On Linux, one can find a lot of programs that use it. The simulation could now have a configuration file looking like this:

[lattice]
length_time = 8
length_space = 8

[md]
time_step = 0.01
beta = 1
steps = 100

The advantages of the format:

Easy to read and write
One level of hierarchy
Comments

Disadvantages:

Only exactly one level of hierarchy
Not nestable
All values are just strings

JSON

The format which is a bit more complex is JSON. It can represent lists and dictionaries of strings, numbers, and boolean types. This makes it suitable for data exchange. The configuration file would look like this:

{
    "lattice": {
        "length_time": 8,
        "length_space": 8
    },
    "md": {
        "time_step": 0.01,
        "beta": 1,
        "steps": 100
    }
}

Advantages:

Nestable
Aware of data types
Extremely widespread

Disadvantages:

Cumbersome to read and write
No comments possible

YAML

YAML is my favorite because it is both human-readable and has a lot of flexibility. The configuration would look like this:

lattice:
    length_time: 8
    length_space: 8
md:
    time_step: 0.01
    beta: 1
    steps: 100

Advantages:

Nestable
Aware of data types
Comments possible
Easy to read, reasonable to write (watch the whitespace)
Allows for object references and can serialize custom data types as well

Disadvantages:

There are libraries for C, C++, Python, and R (and many other languages), it seems just a little less of a de-facto standard than JSON.

XML

XML is as classic as JSON, there are parsers for every language, most languages even have multiple parsing libraries.

<lattice>
    <length_time>8</length_time>
    <length_space>8</length_space>
</lattice>
<md>
    <time_step>0.01</time_step>
    <beta>1</beta>
    <steps>100</steps>
</md>

Advantages:

Nestable
Comments possible
Widespread, great tooling available
Offers a lot of complexity, one can also add attributes.
Has DTDs which let you define valid tags for your configuration file

Disadvantages:

Really verbose, hard to read, horrible to write by hand
Not aware of data types directly

The Chroma software uses XML for input files and also for its output. Once you have a few XML tools, like a parser with XPath, working with XML becomes rather nice. Also it is nestable such that the full input file can included in the output inside some tag. Therefore you can just carry along all previous input and still parse it as one large tree.

For most projects, XML is overkill. For Chroma, YAML would probably be slightly better, but there is no point in changing this afterwards.

Conclusion

For every new project that is some sort of simulation software, I would use YAML as format for the most things possible. Parameters go into a YAML configuration file which is read by the simulation program. There won't be print statements in the code but there will be structured output also in the YAML format. This way it is very easy to parse the output with a program if one desires to do so. Data that is written out is also structured in YAML.

Only large data sets (more than can reasonably fit into a text file, perhaps 1000 or so) would be stored in some binary format. An apparently good choice is HDF5. That is standardized, has tooling and libraries around. Also it supports enough complexity such as hierarchy and meta data that it can be used to store complicated data sets.