My GNU Autohell Story
In academic software development I have seen one software suite which was known to be hard to compile. I was tasked to do try it, and ended up writing a 1200 line Bash script which took care of all the edge cases.
While I was working on my Master thesis at the university in 2016, I worked with a software package for lattice quantum chromodynamics simulations. It is written in C++ and has multiple modules, resulting in around 600,000 lines of code. One of my first tasks was just getting it compiled on the computer. That sounds like it should be easy, but it was a five week journey into GNU Autohell. This story is a couple of years old, but a recent conversation with a coworker has brought it back into memory and I want to share it now.
Simulation software of that type needs to be run on a supercomputer, there is no way that one could gain any traction on a single server, let alone a laptop. Academics get their computer allocations via grants, and larger groups usually have multiple grants in parallel on different machines. Each machine has a slightly different target audience and therefore it makes sense to have budgets on multiple machines. The work group has its own simulation package, but my task was to try out the “competitor's” package to find out whether it would be useful for us. Perhaps a collaboration would emerge if it turned out to be viable. Nobody in the group had significant experience with that, so I had to start from scratch.
Since the group had good experiences with the IBM BlueGene/Q machine in Jülich, I was tasked to start with that machine. So I downloaded the source code, and tried to compile it. There were of course compilation issues, and I needed to understand them. It turned out to be a seemingly endless cycle of compiler errors, research, and tweaking.
Early on it was clear that the build process would be so complicated that I wanted to have a script that does all the work for me. Due to the complexity I thought that Python would be the better choice. Over time I found that I was just calling shell programs the whole time, so I bit the bullet and wrote everything as a Bash script.
GNU Autotools and git
The software uses GNU Autotools as build system. This means that one has the ./configure
, then make
and then make install
steps and it is all good. Except when it isn't, like in my case. In some of the repositories, there was no configure
script. My previous experience with Autotools was that this file must be present in source distributions. So I bugged the maintainers of the code to include that file in the git repository. With that it did not work, because the version of Autotools on the supercomputer was not the same that the developers used to develop. I tried to get the exact same version of Autotools to make it compile. Only later I found out that all this was wrong. The configure
should not be included, but rather it should be generated with autoconf
from the configure.in
file, generate a configure.ac
and configure
.
So I called autoconf
, but got failures with automake
. Running either did not work. After lots of digging I found that the project structure was causing additional problems. The developers have made heavy use of git sub-modules, where one repository is included as a sub-directory in the parent one. Calling autoconf
will automatically traverse down in the sub-directories and call autoconf
there as well. But if the files there are in some weird state, it will fail. One needs to call autoconf
and automake
in a bottom-up fashion. And one needs to use autoreconf -fi
to install missing files, which autoconf
doesn't do on its own. I came up with this function, which calls autoconf
bottom-up in the git sub-modules and then in the parent directly. Oh, and one needs to call aclocal
for some of the projects, linking will not work otherwise and fail with obscure errors.
autoreconf-if-needed() { if ! [[ -f configure ]]; then if [[ -f .gitmodules ]]; then for module in $(git submodule foreach --quiet --recursive pwd | tac); do pushd "$module" autoreconf -vif popd done fi aclocal autoreconf -vif fi }
But the developers were inconsistently checking in these generated files into their git repository. They seemed to be as clueless as I had been about the build system, which is bad. Sometimes they even had the generated Makefile
checked in. But just removing them was not feasible, because some of the modules did not use Autotools but just a hand-written Makefile
. So I needed to carefully check that first.
remove-configure() { pushd "$1" if [[ -f Makefile.am ]]; then rm -f Makefile fi rm -f configure popd }
Installation to home directory
The trouble continued with the users not having root access to the supercomputers, one only has a regular user account. That makes totally sense, but that means that sudo make install
wouldn't work. But that also meant that the libraries wouldn't get installed into /usr/lib
, the headers wouldn't be installed into /usr/include
and the compile flag helpers weren't in /usr/bin
. I had to manage the paths on my own and install everything into ~/local
. Autotools also supports that, I just needed to supply --prefix=$prefix
for every ./configure
call. And then later on I needed to specify --with-foo="$prefix"
such that the dependencies could be resolved later.
External dependencies
The software suite also depended on other software packages, which sometimes were already installed on the systems. They were not installed system-wide, but rather with a module system such that one would for instance do module load Intel
and got the Intel C++ compiler. The problem with some module implementations is that they don't set an exit status. You cannot check with a program whether there was a failure. So I needed to wrap this and parse standard error.
checked-module() { set +x if ! module "$@" 2> module-load-output.txt; then cat module-load-output.txt exit 1 fi set -x cat module-load-output.txt if grep -i error module-load-output.txt; then exit-with-error "There has been some error with 'module $*', aborting" fi rm -f module-load-output.txt }
These commands would then modify environment variables like PATH
and LD_LIBRARY_PATH
such that the other software packages could be found. The module names and environment variables would differ. On some systems there wouldn't even be modules for something. For instance the GNU Multi Precision library was such a case. On one machine it was directly available. On the second machine I needed to load a module. And on two others I needed to compile it from scratch.
case "$_arg_host" in hazelhen) gmp="-lgmp" ;; jureca|jureca-booster) set +x checked-module load GMP set -x gmp="$EBROOTGMP" ;; marconi-a2|qbig|marconi-a3) # The GNU MP library installed on the Marconi system, but it might be # linked against an older version of the standard library. Therefore we # compile it from scratch. repo=gmp print-fancy-heading $repo cflags="$base_cflags" cxxflags="$base_cxxflags" pushd "$repo" autoreconf-if-needed popd mkdir -p "$build/$repo" pushd "$build/$repo" if ! [[ -f Makefile ]]; then $sourcedir/$repo/configure $base_configure \ CFLAGS="$cflags" CXXFLAGS="$cxxflags" fi make-make-install popd gmp="$prefix" ;; *) exit-with-error "There is no default setting for the gmp for machine $_arg_host, please add this to the script." ;; esac
Sometimes these modules depend on each other. I needed to figure out the correct order to load them. And on one Cray system one even had to unload the Cray environment before one could load the Intel environment. However, unloading did now work, so one needed to swap them with a different operation.
Wrong branch
I was advancing further and further in compiling all the little modules, but I kept getting super strange errors. It looked like real errors in the C++ code, something that could never compile, no matter which compile option you passed. I kept digging, but couldn't make any sense out of it. So I contacted the developers, and they were a bit surprised that I tried to compile the master
branch. They haven't updated that for years, so I should instead take the devel
branch. Great, why not make the devel
the default branch on GitHub then?
This meant that I had to ensure that a certain branch was checked out. Unfortunately, some of the generated files were checked into git. Trying to check out a different branch is not possible with local changes. So resetting the state is one step, but not sufficient. If generated files are not checked in on the current branch, but on the target branch, the checkout would overwrite untracked files. Git doesn't know that I don't care for these files, so I also had to clean the directory. With these steps, one can reliably change the branch.
ensure-git-branch() { branch_actual="$(git rev-parse --abbrev-ref HEAD)" branch_target="$1" if [[ "$branch_actual" != "$branch_target" ]]; then git reset HEAD --hard git clean -df git checkout "$branch_target" remove-configure . fi }
Static and dynamic linking
With the development versions I got to advance further, which was great. But eventually I started to get linker errors from a mixture of static and dynamic libraries. The hack was to just remove the built .so
files afterwards. Calling make
in a deeply nested project took like 15 seconds to figure out that nothing had to be done. ALso it would rebuild the dynamic libraries, so I needed to wrap make
and have a sentinel file. This made incremental runs of the script much faster.
make-make-install() { if ! [[ -f build-succeeded ]]; then if ! nice make -j $_arg_make_j; then echo "There was issue with the compilation, doing again with single process to give readable error messages." print-fancy-heading "Compile again" make VERBOSE=1 fi if ! make install; then echo "There was issue with the installation, doing again with single process to give readable error messages." print-fancy-heading "Install again" make install VERBOSE=1 fi touch build-succeeded if [[ -d "$prefix/lib" ]]; then pushd "$prefix/lib" rm -f *.so *.so.* popd fi fi }
Accelerator library
Most supercomputers have a specific compiler that is associated with the vendor. For the IBM BlueGene/Q there is the IBM XL C Compiler, which would be ideal. It knows the most about the PowerPC AC2 chip that is used there, knows about the in-order nature that is has and lots of other stuff. But it just couldn't do C++11, so it was out. So instead I used either GCC or Clang, I cannot remember exactly which I tried. It was clear that performance might be rather bad with the non-vendor compiler.
After a lot of fighting, I finally had it compiled with one of the free compilers. Performance turned out to be super slow. And since we had a fixed computer time budget, one could either forcefully use that inefficient code or do a bunch of better things with the budget. So the only way to go forward was to make it more efficient. Luckily I had seen an option to link an accelerator library.
Downloading that library was easy, it was programmed by somebody who had also helped to design that supercomputer. I also got it compiled, but it just wouldn't want to link to the simulation software. The library calls just did not match the API of the accelerator library. I have contacted the developer of the accelerator and the simulation library, but didn't get much feedback. The original developers don't have access to an IBM system any more, so that feature is unmaintained. Further digging brought some more clarity: The simulation package only supports accelerator 1.0, but this only works on BlueGene/L. The computer I had, the “Q”, needed version 3.0. So this was a dead end and it was clear that this simulation package would not run on that computer without a big effort.
Next machine
That was very bad news, all the work was for nothing, I had to start compiling on a completely different machine. The next was a CPU machine with just regular Intel Xeon CPUs and a fast Infiniband network. There are nothing special about it. We decided to use the Intel C++ compiler to get the most performance. Compilation of the software needed a bunch of different flags now as the compiler had changed.
I'll show you the code that my script ended up with at the end of the whole projects, where it supported seven different machines. It is a lot of code, and I feel ridiculous for putting it into this blog article. But I want to show how ridiculously complex these environments are because the GNU Autotools don't abstract this away in any sense. The user has to cope with all the crazy idiosyncrasies.
case "$_arg_host" in hazelhen) isa=avx2 compiler="${_arg_compiler:-icc}" ;; jureca) isa=avx2 compiler="${_arg_compiler:-icc}" ;; jureca-booster) isa=avx512 compiler="${_arg_compiler:-icc}" module load Architecture/KNL ;; local) if [[ -z "$_arg_isa" ]]; then exit-with-error "Builds on local machines require the -i option to be passed (ISA: avx, avx2, avx512)" fi isa=$_arg_isa compiler="${_arg_compiler:-gcc}" ;; marconi-a2) isa=avx512 compiler="${_arg_compiler:-icc}" ;; marconi-a3) isa=avx2 compiler="${_arg_compiler:-icc}" ;; qbig) isa=avx compiler="${_arg_compiler:-gcc}" ;; *) exit_with_error "The machine $_arg_machine is not supported by this script. Please use one of the supported machines (see -h for help) or ideally extend this script at https://github.com/HISKP-LQCD/chroma-auxiliary-scripts/tree/master/compilation :-)" ;; esac # If the user has specified an ISA, we override the setting. if [[ -n "$_arg_isa" ]]; then isa="$_arg_isa" fi # Set up the chosen compiler. case "$compiler" in # The cray compiler does not support half-precision data types (yet). So # one cannot actually use that for QPhiX right now. cray) cc_name=cc cxx_name=CC color_flags="" openmp_flags="" base_flags="-O2 -hcpu=haswell" c99_flags="-hstd=c99" cxx11_flags="-hstd=c++11" disable_warnings_flags="" #qphix_flags="" #qphix_configure="" ;; icc) color_flags="" openmp_flags="-fopenmp" c99_flags="-std=c99" cxx11_flags="-std=c++11" disable_warnings_flags="-Wno-all -Wno-pedantic -diag-disable 1224" #qphix_flags="-restrict" # QPhiX can make use of the Intel “C Extended Array Notation”, this # gets enabled here. #qphix_configure="--enable-cean" case "$_arg_host" in jureca) checked-module load Intel/2018.1.163-GCC-5.4.0 checked-module load IntelMPI/2018.1.163 silent module list cc_name=mpiicc cxx_name=mpiicpc host_cxx=icpc ;; jureca-booster) checked-module use /usr/local/software/jurecabooster/OtherStages #checked-module load Stages/2017a checked-module load Intel #checked-module load Intel/2017.2.174-GCC-5.4.0 checked-module load IntelMPI #checked-module load IntelMPI/2017.2.174 checked-module load CMake silent module list cc_name=mpiicc cxx_name=mpiicpc host_cxx=icpc ;; hazelhen) # On Hazel Hen, the default compiler is the Cray compiler. One needs to # unload that and load the Intel programming environment. That should # also load the Intel MPI implementation. checked-module swap PrgEnv-cray PrgEnv-intel checked-module load intel/17.0.6.256 silent module list # On this system, the compiler is always the same because the module # system loads the right one of these wrappers. cc_name=cc cxx_name=CC host_cxx=icpc ;; marconi-a2) checked-module load intel/pe-xe-2017--binary checked-module load intelmpi silent module list cc_name=mpiicc cxx_name=mpiicpc host_cxx=icpc ;; marconi-a3) checked-module load intel/pe-xe-2018--binary checked-module load intelmpi silent module list cc_name=mpiicc cxx_name=mpiicpc host_cxx=icpc ;; *) exit-with-error "Compiler ICC is not supported on $_arg_host." esac case "$isa" in avx) base_flags="-xAVX -O3" ;; avx2) base_flags="-xCORE-AVX2 -O3" ;; avx512) base_flags="-xMIC-AVX512 -O3" ;; esac ;; gcc) color_flags="-fdiagnostics-color=auto" openmp_flags="-fopenmp" c99_flags="--std=c99" cxx11_flags="--std=c++11" disable_warnings_flags="-Wno-all -Wno-pedantic" qphix_flags="-Drestrict=__restrict__" qphix_configure="" case "$isa" in avx) base_flags="-O3 -finline-limit=50000 -fmax-errors=1 $color_flags -march=sandybridge" ;; avx2) base_flags="-O3 -finline-limit=50000 $color_flags -march=haswell" ;; avx512) base_flags="-O3 -finline-limit=50000 -fmax-errors=1 $color_flags " #-march=knl ;; esac case "$_arg_host" in hazelhen) silent module swap PrgEnv-cray PrgEnv-gnu silent module list cc_name=cc cxx_name=CC host_cxx=g++ ;; jureca) checked-module load GCC checked-module load ParaStationMPI silent module list cc_name=mpicc cxx_name=mpic++ host_cxx=g++ ;; local) cc_name=mpicc cxx_name=mpic++ host_cxx=g++ base_flags="-O3 -finline-limit=50000 -fmax-errors=1 $color_flags -march=native" ;; marconi-a2|marconi-a3) checked-module load gnu checked-module load ParaStationMPI silent module list cc_name=mpicc cxx_name=mpic++ host_cxx=g++ ;; qbig) cc_name=mpicc cxx_name=mpic++ host_cxx=g++ ;; *) exit-with-error "Compiler GCC is not supported on $_arg_host." esac ;; *) exit-with-error 'This compiler is not supported by this script. Choose another one or add another block to the `case` in this script.' ;; esac
Cross compilation
Over the course of the master thesis there was also a new machine opened in Bologna. It featured the new Intel Xeon Phi Knights Landing chips. They have a massive amount of cores, they are basically a GPU on a CPU socket. Programming these means programming like for GPUs. And although they can boot an operating system, the compute centers choose to have the frontend nodes with regular Intel Xeon chips. This means that one needs to cross-compile everything for the target architecture.
GNU Autotools supports cross compilation, but everything gets one step more complicated. When you compile for a different target, you cannot execute the generated programs on the machine that generated them. Autoconf usually compiles small programs to test things out, but without running them it doesn't know what the result is. This means that one has to supply more information about the host platform, which just adds to the list of compilation flags.
Generated code
One of the modules had generated code in it. The developers have separated that into a different git repository such that the history would not be polluted with it. It was only sporadically updated and did not always reflect the latest state of the code.
It is a bit puzzling to generate code as text and compile that with a C++ compiler if one already has the preprocessor and templates to generate code during compilation time. There are a few good arguments for it, namely that one can inspect the generated code and see whether it does the correct thing. Otherwise one would have to fully trust the compilers to generate efficient code. And some compilers aren't exactly trustworthy. Others are not capable of the template magic required. The problem was that on Knights Landing we needed to generate around two million lines of code which then got compiled into a static library of around 1 GB in size. It just felt wrong to do it this way.
Eventually I moved the code generation stage into the compilation stage of the whole software suite. This removed the inconsistencies, but it drew in Python as a dependency because I didn't want to do code templates with sed
and rather wanted to use the Jinja template library for it.
Python versions
As I needed Python 3, I needed to find it on the systems. Modern Linux distributions ship with Python 3. But some supercomputers run on rather old versions of CentOS and provide newer software only via modules. One computer had a Python 3.4.3 module, but I needed to have a more recent version. There were four installations of Python on that machine:
-
/usr/bin/python3
(3.4.6) -
/opt/python/17.11.1/bin/python3
(3.6.1) -
/opt/python/3.6.1.1/bin/python3
(3.6.1) -
/sw/hazelhen-cle6/hlrs/tools/python/3.4.3/bin/python3
(3.4.3)
In the end I used the version 3.6.1 which wasn't really exposed as a module. It didn't feel too good. On the other machines there were modules, but not two of them had the same name. It was just an inconsistent mess that I needed to cope with in my script.
case "$_arg_host" in hazelhen) export PATH="/opt/python/3.6.1.1/bin:$PATH" ;; jureca|jureca-booster) checked-module load CMake checked-module load Python ;; marconi-a2) checked-module load cmake checked-module load python ;; marconi-a3) checked-module load cmake checked-module load python/3.6.4 ;; esac
To make it worse, CMake could not find the Python interpreter properly, so I needed to let Python tell me its path, store that in variables and then pass that to CMake.
cmake_python_include_dir="$(python3 -c "from distutils.sysconfig import get_python_inc; print(get_python_inc())")" cmake_python_library="$(python3 -c "import distutils.sysconfig as sysconfig; print(sysconfig.get_config_var('LIBDIR'))")"
Compilation constants
Different CPUs have some other intrinsic sizes for SIMD registers or cache sizes. It therefore makes sense to have the fundamental data structures as C++ templates which can be adapted for different architectures. At some point one needs to specialize them and create concrete instances. One can do that with a list of values and choose at run-time. This would make it the easiest as the compilation would be the same and one could even switch the modes as one liked.
The developers chose to expose these things as compilation constants. This has the advantage that the code does not need to have any abstractions for the different data types (float
or double
) and the different SIMD lengths and can always use the one that is specified with the compile constant. But the compiling user has to track all these flags and pass them to the various models in the suite. The users also have to build the whole software stack twice in different locations when they want to try two variants on the same machine.
No internet access
One of the machines that I worked on was hosted in Stuttgart. It is not purely for academic but also industry usage. Therefore they have gone to the extreme and cut off all outgoing internet access. This means that you cannot do a git pull
to get the code, you need to download it to your laptop and then use rsync
to get it over to the machine. They suggest that you open your own HTTP proxy via an SSH tunnel port forwarding, but then everyone else on the supercomputer frontend could use your tunnel to access the internet and do anything with it. Not the best idea to do if you don't want crimes committed with your connection. Besides that it is super slow when working from home, as everything had to be uploaded from your DSL or cable internet.
I ended up writing the script such that it downloads all the software first and allows a break there. Below that break, no further internet access was needed. The user had to partially run the script on the laptop, copy the code and then run the script there.
Archive formats
Another hilarious complication was the GNU MP library because it is provided as LZ compressed tar files form their website. On the CentOS that was used on at least one machine, there was no lzip
command available. I didn't want to compile that from source only to unzip the sources of GNU MP, so I have downloaded some version of GNU MP, repackaged it as a plain GZIP compressed archive and uploaded it to my web space on this domain.
Community uptake
Initially I had developed the compilation script for myself only. I have programmed it in the hope that the upstream developers would integrate it. But they never did. They have their own compilation script, but that only works for their machine. It was not abstracted enough to support other machines. They didn't have an incentive to use something other than their script. They didn't need to make their build system easier because they had their script. And all I could do is to maintain my build script somewhere else.
Other people in the community have asked me for the script as they needed to compile the software suite and felt daunted by the sheer number of modules and their compilation flags. I was happy to provide my script and they just needed to run it for around 60 minutes to have a working installation of the simulation software suite. Eventually people asked for other machines, so I needed to figure out the idiosyncrasies of the other machines.
If this work would have been put into the build system, it would have benefited a lot more users. And I tried to port one of the libraries to CMake, which made it much easier to compile than before. But the other ones would need to be ported, and I didn't feel much excitement for that upstream.
Conclusion
This story gives a good idea of software development in academia. It isn't really anybody's fault because the system is so stacked against the people. The software is developed by PhD students and postdocs, which aren't working on the same project for more than a couple of years. They might go into the industry (like myself), change fields, change universities and collaborations. The software needs a product owner which has a permanent position. Only professors have such positions, but they have little time to head a software development effort. They are in a mix of teaching, writing grants, supervising students, writing papers, getting papers reviewed and publish, review papers themselves, take care of administrative things, represent the university, form collaborations, speak at conferences and whatnot.
Also nobody really cares about the software itself. The only hard success metric of researchers are the number of papers that the crank out. It they manage to do that on badly maintained software, that is great! They shouldn't have wasted time in something that the world doesn't see; at least that is how it appears to be. But without great software, researchers cannot deliver anything these days. Simulations depend on complicated software with tight performance requirements. If these are not maintained properly, research will stall.
It needs nothing short of a completely different way that software is done in academia. Universities need to employ software developers that help the researchers develop maintainable software, point them to standard solutions and prevent researchers from building software that needs an external 1200 line bash script to configure and build.