The Case for Standard Configuration File Formats
The numerical software that I work with is written as command line tools. As such, they get their parameters from command line arguments and/or parameter files. Their output is written to files as well. Since one can redirect standard output into a text file, we will only consider files here.
The format of these files have some sort of format. Usually the output
generated using print
statements is really ad-hoc and only of an informative
nature. It is rather hard to read with a program. We will have a look at the
input parameters first.
Command Line Arguments
Positional Arguments
The worst thing from a usability point of view are programs that only have
positional arguments. This can be nice for simple programs like cp
where it
is rather clear what happens. But let's say we want to run a HMC simulation
like my su2-hmc
program. There are a lot of
parameters needed:
- Temporal lattice extent
- Spatial lattice extent
- Molecular dynamics time step
- Coupling β
- Number of molecular dynamics steps
- Number of updates to perform
- Skipping of storing configurations
- Initial seed
- Initial standard deviation
One could just do this:
$ su2-hmc 8 8 0.01 1 100 180 0 0 0.0
The problem is that the order of these arguments is crucial. And since they are all numbers, they are easily interchanged. Then the program still runs, but it is not doing what you wanted.
Also arguments cannot be optional, users will have to supply all of them.
Named Arguments
Some improvement is achieved by using named arguments. Then a call to the program would look like this:
$ su2-hmc --time 8 --space 8 --time-step 0.01 --beta 1 --steps 100 \ --total 180 --skip 0 --seed 0 --hot-start-std 0.0
This is more readable, but it now needs two lines. For a more complex setup, the line will be significantly longer. Arguments can have a default value, so one does not need to supply them any more. There is no danger of interchanging values.
Remembering the Invocation
A few weeks or months after the simulation ran, somebody will ask you what parameters you have used. This is a completely reasonable question, but it might be rather hard to answer. If the program just took the options on the command line, they are not written down. They might be in the shell history, but with Bash this can be rather fragile. Also your co-workers might be able to see the files on a shared filesystem, but they cannot look into your shell history. And even if they could, they still would not know the directory where the command was executed.
One way to prevent this is by writing all commands to a file into the same directory. This way one can easily check which parameters have been used. The program can be changed such that it outputs all command line arguments to a file.
I have done this, but in retrospect this does not make much sense. Instead of writing a command line parser in the program and then also writing out the options, it would be much easier to just write a short shell script that then invokes the program. This way you also have an auditing trail but no need to change your program. The same approach can also be used with every program.
Configuration Files
The shell script is an improvement, but it is not really a configuration file. One cannot have a hierarchy and also one cannot have comments in it. Lots of configuration files on Linux work in a way that they already have every option set to the default value but with a comment. This way the user can learn about the options and override values easily. Additionally, configuration files provide an audit trail.
I'd even go so far and say that the program should only take the path to the configuration file and nothing more. Exceptions are options which really have no impact on reproducibility.
Custom Configuration File
Unfortunately, a custom configuration file format is quickly invented. One can start with something like this:
length-time = 8 length-space = 8 time-step = 0.01
Writing a parser for this is not hard either. One just split the line at =
,
trims off excessive whitespace and converts the argument from string to the
desired type. A simple Python implementation would look like the following:
with open(config_file) as file: options = {} for line in file: parts = line.split('=') key = parts[0].strip() value = eval(parts[1]) options[key] = value
It will work, but passing untrusted input to eval
will make for great
security risks. If somebody gave you a configuration file with malicious code
in it, that would then be executed on your account. But that could be changed
in the implementation.
Another problem is that this does not allow for comments in the file. A user
might expect that lines starting with #
will be ignored. It to some extent
will be because that adds a key
into the dictionary which will not be looked
into. A =
sign in the value part will be a problem, though. With this
implementation it will be silently ignored.
There are other examples of custom formats that have their issues: One program parses the file exactly once and expects the options to be in a particular order. If one gets this wrong, the options are not read correctly.
At some point one might notice that the format lacks features. Usually a hot-fix is implemented into the parser without thinking whether the format is still sensible.
The format is only used within the program, it effectively is intertwined with the program. Therefore no other program can read the file. If it is a simulation package in C++, one cannot read the parameter file with an analysis program in R or Python.
Standard Format
Using a standardized format has many advantages:
- You don't need to write your own parser. That has been done already in many programming languages and can easily be used. Also the parser probably is faster than a self-written one and is a bit more robust.
- The formats have various extra features like comments, hierarchy, native representation of data types. These might not seem to be required now, but it is good if the format has some complexity in reserve in case it is needed.
- Text editors also know about these formats and can provide syntax highlighting and indentation.
- The configuration file can be parsed with different programs. The interpretation of the values of course differs, but an analysis program can simply read the parameters of the simulation without being told.
INI
A very simple format is INI. It was used in Windows extensively, now they use the registry. On Linux, one can find a lot of programs that use it. The simulation could now have a configuration file looking like this:
[lattice] length_time = 8 length_space = 8 [md] time_step = 0.01 beta = 1 steps = 100
The advantages of the format:
- Easy to read and write
- One level of hierarchy
- Comments
Disadvantages:
- Only exactly one level of hierarchy
- Not nestable
- All values are just strings
JSON
The format which is a bit more complex is JSON. It can represent lists and dictionaries of strings, numbers, and boolean types. This makes it suitable for data exchange. The configuration file would look like this:
{ "lattice": { "length_time": 8, "length_space": 8 }, "md": { "time_step": 0.01, "beta": 1, "steps": 100 } }
Advantages:
- Nestable
- Aware of data types
- Extremely widespread
Disadvantages:
- Cumbersome to read and write
- No comments possible
YAML
YAML is my favorite because it is both human-readable and has a lot of flexibility. The configuration would look like this:
lattice: length_time: 8 length_space: 8 md: time_step: 0.01 beta: 1 steps: 100
Advantages:
- Nestable
- Aware of data types
- Comments possible
- Easy to read, reasonable to write (watch the whitespace)
- Allows for object references and can serialize custom data types as well
Disadvantages:
- There are libraries for C, C++, Python, and R (and many other languages), it seems just a little less of a de-facto standard than JSON.
XML
XML is as classic as JSON, there are parsers for every language, most languages even have multiple parsing libraries.
<lattice> <length_time>8</length_time> <length_space>8</length_space> </lattice> <md> <time_step>0.01</time_step> <beta>1</beta> <steps>100</steps> </md>
Advantages:
- Nestable
- Comments possible
- Widespread, great tooling available
- Offers a lot of complexity, one can also add attributes.
- Has DTDs which let you define valid tags for your configuration file
Disadvantages:
- Really verbose, hard to read, horrible to write by hand
- Not aware of data types directly
The Chroma software uses XML for input files and also for its output. Once you have a few XML tools, like a parser with XPath, working with XML becomes rather nice. Also it is nestable such that the full input file can included in the output inside some tag. Therefore you can just carry along all previous input and still parse it as one large tree.
For most projects, XML is overkill. For Chroma, YAML would probably be slightly better, but there is no point in changing this afterwards.
Conclusion
For every new project that is some sort of simulation software, I would use
YAML as format for the most things possible. Parameters go into a YAML
configuration file which is read by the simulation program. There won't be
print
statements in the code but there will be structured output also in
the YAML format. This way it is very easy to parse the output with a program if
one desires to do so. Data that is written out is also structured in YAML.
Only large data sets (more than can reasonably fit into a text file, perhaps 1000 or so) would be stored in some binary format. An apparently good choice is HDF5. That is standardized, has tooling and libraries around. Also it supports enough complexity such as hierarchy and meta data that it can be used to store complicated data sets.