C++ Anti-patterns
This is a list of C++ anti-patterns that I have seen in various codes.
Preprocessor
Lowercase preprocessor constants
Preprocessor constant should always be in capital letters. Otherwise you will
get the weirdest of bugs that take hours to days to track down. Say there is
some feature that you want to switch on/off during compile time. One way to do
it would be using #ifdef
like this:
int make_query(Query const &q) { // Do some query stuff here. #ifdef VERIFY_RESULT // Perform some verification. #endif // Some other tasks and return statement. }
When compiling, you can give the option -DVERIFY_RESULT
and the preprocessor
will put in the verification code. In one project I have seen this same thing,
but it was with #ifdef verify_result
. That is legal C++ and also works with
the -Dverify_result
command line option to the compiler.
One member of the project who just recently joined and did not know about the
many compilation flags just created a new function
bool verify_result(Result const &result)
. The compiler did not say that the
name was not in use. But rather, the compiler saw bool (Result const &result)
when the option was given because that flag did not get any value. GCC
complained that it did not expect the (
there.
I do not know how long they tried to track that down, but it was long enough to be a real pain and certainly a big waste of time. This would not have happened if the preprocessor constants had been uppercase all the time.
Preprocessor over scopes
Another project make a lot of use of preprocessor flags to change the way the code works. This in itself is not a problem. However, there are pieces of code like this:
#ifdef FLAG } #endif
With multiple such flags, I do not see how you can sensible reason about this code.
Including source files
This was in one of the header files:
#include "other.cpp"
Including source files in header files usually is just wrong. One edge case are unity builds or template specializations. Said project just did it for no apparent reason.
Unnamed #if 0
branches
The pair #if 0
and #endif
can be used to quickly comment out large chunks
of code. Contrary to /*
and */
, they have the advantage that they can be
nested and also serve as a sort of super-comment.
When there are multiple of those blocks around, it becomes a bit hard to understand why some are enabled and others are not enabled:
#if 0 // ... #endif // ... #if 1 // ... #endif // ... #if 0 // ... #endif
Is the middle one the negation of the first one? Is the first and third one actually the same; do they have to be enabled at the same time?
Here I like it better to introduce some named constant for that. The above could look like the following:
#ifdef POLISH_WIDGETS // ... #endif // ... #ifndef POLISH_WIDGETS // ... #endif // ... #ifdef POLISH_WIDGETS // ... #endif
Then it would be easy to understand what is going on. If somebody wanted to
change the code, a single #define POLISH_WIDGETS
would be enough and
everything would be consistent and readable.
Header files
Standalone header files
A header file should include all the other headers it needs to be compiled. Something I have seen a couple times is this here:
f.hpp
:
#pragma once void f(MyClass x);
Notice that the type MyClass
is not defined here. It is defined in this
header, but that is not included in f.hpp
at all.
It is defined in myclass.hpp
:
#pragma once class MyClass { public: int member; };
The implementation of the function f
is in f.cpp
. There the header for
MyClass
is included:
#include "myclass.hpp" #include "f.hpp" void f(MyClass x) { x.member = 0; }
And that actually compiles because the compiler only works on the .cpp
files.
And given the ordering of the #include
statements, the C++ compiler sees
this:
class MyClass { public: int member; }; void f(MyClass x); void f(MyClass x) { x.member = 0; }
This will work as long as that header f.hpp
is always included after
myclass.hpp
. It will start to fail when you use f.hpp
somewhere else. There
should be an #include "myclass.hpp"
in f.hpp
.
Wrong order of header files
The order of the header files in f.cpp
in the above example made it possible
to hide the missing include inside f.hpp
such that it still compiles. One can
make this fail earlier by having the following order of includes in the .cpp
files:
- "Own" header file, for
X.cpp
that would beX.hpp
- Project header files
- Third-party library headers
- Standard library Headers
This way, missing include statements in the header will become directly apparent. You might find a bug in a third-party library this way.
Cyclic dependencies
I saw a case where the file A.hpp
provides the classes A1
and A2
. The
file B.hpp
provided B1
. For some reasons, the dependencies were A1
→ B1
→ A2
. That is not directly a cyclic dependency of the types but of the header
files. This was "solved" in the project like this in file A.hpp
:
class A1 { ... }; #include "B.hpp" class A2 { ... };
Sure, that compiled just fine. But again, header B.hpp
is not a standalone
header and can only be used in this sandwich. I have resolved this by creating
the files A1.hpp
and A2.hpp
and properly including them inside each other
to break the cyclic dependency of the files.
Mixing up system and local headers
There are two distinct include paths, the one that gets searched when you use
#include <…>
and another for #include "…"
. The former is for system and
third-party library headers that are installed globally. You can add to that
path using -I
. The latter is for your project and can be amended with -i
.
One should not mix the two and use -I.
in order to include the local headers
with #include <…>
.
using namespace
inside headers
One can argue about the usage of using namespace
. In most cases I will not
use it to keep the global namespace somewhat clean. Using
using namespace std;
in a .cpp
file is okay because the namespace std
is
just dumped out in a single compilation unit. I do not have to worry about it
in other files.
It becomes a totally different story once you put that using namespace std;
into a header file. Then every other header or source file that includes it
will have that namespace dumped out. Even other projects using that library
will suffer from the attempt to save some typing of the library programming.
Multiple files with same include guard
I do not like the #ifndef X
, #define X
, #endif
style include guards
because they are so error prone. They are hard to read but even worse, one can
choose the same include guard twice in a project. One could even have the same
include guard that some other library already has used.
If that is the case, the errors will be painfully subtle. Depending on the order if inclusion, the last header file will just not be included. Then you get errors about undefined types and undefined functions and might pull out your hair before you realize what is going on.
#pragma once
is a good alternative except if you are working with IBM XL 12.
But that compiler cannot do C++11, so it is not that interesting for me anyway.
Include guards do not match the filename
If you have #ifndef FOO_H
in bar.h
, bad things will happen when you create
another foo.h
. Therefore the include guard should always be derived from the
path, or use #pragma once
.
C library headers
Using C functions in C++ might be a reasonable thing to do. If you do that, do
not include X.h
but rather cX
. So instead of <stdint.h>
use <cstdint>
.
This way the C functions will be properly in the std
namespace.
Functions like printf
, scanf
, atoi
, fopen
, malloc
, and a bunch of
others should not be used in C++. There are so much better ways.
include
inside namespaces
For reason unknown to me, there are a few includes of standard library headers
inside of namespaces. The only reason I can think of is that writing ::std::
instead of std::
seemed too much work.
GCC 6.0 has changed the standard headers such that they do not work when you include them inside a namespace. This way one can catch this anti-pattern.
Correctness
NULL
, 0
, and nullptr
In C++ (and C) there are 64-bit unsigned integer number (uint64_t
or
perhaps unsigned long
) and pointers (void *
). In the machine, they are
exactly the same thing, 64-bit unsigned integer numbers. For the type system,
it still makes a great difference.
Although they are both just numbers to the computer, one should help the programmers read the source code and write
uint64_t number = 0; void *pointer = NULL;
This way, it is clear that one is number-zero and the other is pointer-null.
The preprocessor constant NULL
is just 0
or 0u
, so the compiler always
sees a plain 0 there. Still it is helpful for the programmer to read. In C++11
there even is nullptr
which has the correct type, nullptr_t
.
In some code you see horrible things like these:
uint64_t number = NULL; void *pointer = 0x0;
Casting pointers to integers
For some reason in one project, pointers are casted to integers. And they are
converted to a special type of integer that you have to choose as a flag to
configure
such that the length of that integer is the exact same as the
pointer length, otherwise the compiler would give you an error.
I still do not understand why one would do this sort of brittle code.
Subtle dependence on operator precedence
What is the value of x
?
uint32_t x = 1 << 3 - 1;
I would read it as $2^3 - 1$ and that is seven. The value actually is four. When compiling with Clang, it conveniently says this:
precedence.cpp:7:25: warning: operator '<<' has lower precedence than '-'; '-' will be evaluated first [-Wshift-op-parentheses] uint32_t x = 1 << 3 - 1; ~~ ~~^~~ precedence.cpp:7:25: note: place parentheses around the '-' expression to silence this warning uint32_t x = 1 << 3 - 1; ^ ( )
What did the original author of that line really mean? Was he aware that the
-
operation binds stronger than <<
? Can I trust that code without
completely understand where all those magic numbers come from?
Not using RAII
I have seen a couple persons program Java in C++ syntax. With that I mean they wrote stuff like this:
T *t = new T(); // Some code with t-> everywhere. // No delete.
They did not dispose of the allocated memory again. If they were doing this to
hand out the memory or something like that, a heap allocation is needed. But
just for local variables, this is nonsense. I have asked why he did not just
write T t;
and be done with it. He said that he is used to the
T *t = new T();
syntax. Then I pointed out that he has a memory leak. He
replied that the runtime will take care of that. Too bad there isn't a runtime
in C++ ...
Another person doing this style of programming was actually supposed to teach us students C++.
Since C++11 there are the standard std::unique_ptr
and str::shared_ptr
with
std::make_shared
and std::make_unique
(last since C++14 only) that will
eliminate virtually every case where you need a pointer.
Not using const
in C++
const
is an awesome thing. Perhaps Rust with its mut
approaches the problem
even better although I am a fan of the minimal annotation concept that seems
to prevail in C++. Whenever I program in C++, I try to put const
wherever I
can. It helps me to reason about my code (less moving parts) and helps the
compiler to elide some of the variables.
If you do not even use const
for your global variable pi
, you are just
begging for trouble. I mean I would just add the following somewhere to
implement the Indiana Pi Bill:
void indiana() { pi = 4; }
And then the whole program would change its behavior and it would be a pain to debug.
Another funny example that I found in code:
double a = (double)(1.25); // always 1.25
It is good to give this magic number a name, but why isn't there a const
?
Global variables everywhere
The const
keyword is there for a reason. It is a pity that it is not enforced
in C. But that's why we got C++. The project I have in mind consists of almost
only global variables and routines. One cannot even call them functions, they
only have side effects.
This needlessly increases the number of things the programmer has to juggle in his head simultaneously. Unit testing is impossible with such a monolithic code. How do you reason about such code?
Unreachable code
Some code is just plain unreachable. Say you have a contraption like this:
if (condition x) { // something } else { // something else if (condition x) { // What? } }
There are cases when you have multiple threads and you use locks or atomic variables. Then this code might be reached. Otherwise it is just bloat.
Faulty buffer implementation
Using other people's tested code is beautiful. So when the standard library of
your programming language supports something like a buffer (std::vector
perhaps), use it!
One project contained its own buffer implementation which had a serious bug:
When you call clear()
, it will not reset the internal cursor to the beginning
and run into other memory eventually. That might corrupt the data or
(hopefully) crash the program. Just use the standard library!
C-style casts
C++ has the static_cast
, the const_cast
, the dynamic_cast
, and the
reinterpret_cast
. They only cast a particular attribute, not everything.
Often I see a C-Style cast:
int i, j; double x = ((double)i) / j;
It would be better to use a C++ cast instead:
int i, j; double x = static_cast<double>(i) / j;
Then it is clear that it is about the type (int
vs. double
) and not about
removing a const
or something like that.
Mixing signed and unsigned integers
There are signed (int
, long
, signed char
) and unsigned (unsigned int
,
unsigned long
, char
) integer types. For a long time I thought that their
major difference is in the ability to store negative numbers. But that is not
their key difference. What distinguishes the types is that only unsigned
integers have defined overflow behavior.
Most programs use these types for the wrong reasons. Using an unsigned integer forces the caller of the function to give you a positive number. But actually you are specifying (to the compiler) that you want well-defined overflow behavior. Most of the time you do not want that, so using a signed integer is actually the correct thing regarding to the compiler.
See this talk by Chandler Carruth where he explains the issue with a concrete code example.
Always using 0 exit status
The exit status (return value of main
) signals whether an error has occurred
to the caller. This is important when the program is called from a test script.
Without a proper exit code there is no way of reliably determining the outcome
of the called program.
Often I see the following construct:
std::cout << "ERROR: The widgets could not be frobnicated!!" << std::endl; exit(0);
There might be the word "error" on the screen for the user to see, but a
calling script cannot infer that something went wrong. Therefore use exit(1)
or some other non-zero number to give the appropriate signal to the outside.
I like to throw an exception even more. So I suggest to do the following instead:
throw std::runtime_error("The widgets could not be frobnicated!");
This has the following advantages:
- The exit status is automatically set to a non-zero value.
- The error message on the console will be emitted on standard error
(
std::cerr
) - In contrast to
abort()
this will call the destructors of the current scopes. - The keyword
throw
stands out in the editor to signal that this code path is about something special. No need for the word "error" in all-caps, multiple exclamation marks or similar. -
The error message has a standard format:
$ ./a.out terminate called after throwing an instance of 'std::runtime_error' what(): The widgets could not be frobnicated! fish: './a.out' terminated by signal SIGABRT (Abbruch)
-
The user can load the program in the debugger and work with the stack trace to see the origin of the error. For a simple example (exception thrown in
f
) this looks like this:(gdb) bt #0 0x00007ffff716d660 in raise () from /lib64/libc.so.6 #1 0x00007ffff716ec41 in abort () from /lib64/libc.so.6 #2 0x00007ffff7ae3025 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6 #3 0x00007ffff7ae0c16 in __cxxabiv1::__terminate(void (*)()) () from /lib64/libstdc++.so.6 #4 0x00007ffff7ae0c61 in std::terminate() () from /lib64/libstdc++.so.6 #5 0x00007ffff7ae0ea4 in __cxa_throw () from /lib64/libstdc++.so.6 #6 0x000000000040093a in f () at cpp.cpp:7 #7 0x0000000000400964 in main (argc=1, argv=0x7fffffffe588) at cpp.cpp:11
Also I can move around the stack and take a look at all the variables in the program.
Numerics
Insufficient serialization precision
Binary data formats are harder to understand than a simple text file that is human readable. However, storing arrays of numbers in binary has many advantages:
- The file size is just the size of the array in memory. This also means that the IO bandwidth is used more efficiently.
- There is no need to parse the file, it can be loaded directly into the memory.
- As every number has the exact same size, one can parallelize reading and writing of chunks of the file, say with MPI.
- There is no precision loss.
The only disadvantages are that you might have a hard time changing the architecture. With IEEE floating point this should be fine for all usual computers. There are architectures like PowerPC which work in big endian (though PowerPC supports both), whereas the usual x64 CPUs work in little endian.
If you have a complicated binary format which various different elements, you might have a hard time reading the data without the actual program. I would suggest not to do this and instead use something structured like HDF5 to store data and metadata in a binary format.
In projects I have seen people writing out floating point numbers with a simple
fprintf(fp, "%f", number)
. This looks great at first as you can read the
generated text file with a text editor. But IO will be slower than necessary.
Also you only get a few digits (five by default?). This means that writing and
reading your data will introduce rounding errors.
Even though I would strongly recommend binary IO for this, use at least
std::numeric_limits<T>::max_digits10
(cppreference
page)
digits when writing numbers to a text file. This way it will be exact in the
conversion from binary to decimal and back.
Object oriented programming
Too large classes
A class should have a number of invariants. These are statements about the consistency of the thing that the class represents. All member functions need to take the class from one valid state to another one. Within a member function you may temporarily break an invariant if you restore if before the end of the function.
The set of member functions should be as small as possible. Everything else should be implemented as free functions. I often see code where a bunch of convenience functions are crammed into the class.
For an implementation of the Ising model with the Metropolis algorithm one needs some sort of 2D data structure. A simple quadratic lattice can be implemented using this class:
class Lattice { public: using Spin = int8_t; Lattice(int const size) : size_(size), data_(size * size) {} Spin &operator()(int const x, int const y) { return data_[y * size_ + x]; } Spin const &operator()(int const x, int const y) const { return data_[y * size_ + x]; } int size() const { return size_; } private: int size_; std::vector<Spin> data_; };
In a lot of solutions I have seen member functions like
double Lattice::energy() const
, double Lattice::magnetization() const
and
void Lattice::do_update()
. They do not belong in this class! The invariants
of this class are:
- There is enough memory for the whole lattice.
- Each lattice point contains a valid spin value.
The function energy
should rather be implemented as a free
double energy(Lattice const &lattice)
which can then use the
Lattice::operator() const
to compute the total energy of the lattice.
Missing virtual destructor
When a class is meant to be derived from, the destructor should be virtual.
Otherwise there can be memory leaks. Take this very simple example where we
have a base class that has no members and no virtual functions. A derived class
(over which the author of Base
has no control) now introduces manual resource
management with new
and delete
. In the user code we use dynamic
polymorphism. When the unique_ptr
gets destructed it will call delete
on
its internal pointer. And that will then call Base::~Base
which is a
non-virtual empty function supplied by the compiler.
#include <iostream> #include <memory> class Base { public: void hello() { std::cout << "Hello, World" << std::endl; } // virtual ~Base() {} }; class Derived : public Base { public: Derived() : p(new int[1]) {} ~Derived() { delete p; } private: int *p; }; int main() { std::unique_ptr<Base> base(new Derived()); }
The lack of the virtual constructor means that Base
does not have a virtual
function table and that deriving classes have no chance of overriding the
destructor. In this example this introduces a memory leak which is directly
found with the leak sanitizer (-fsanitize=leak
):
==31756==ERROR: LeakSanitizer: detected memory leaks Direct leak of 4 byte(s) in 1 object(s) allocated from: #0 0x7f520ce6d605 in operator new[](unsigned long) (/lib64/liblsan.so.0+0xe605) #1 0x400975 in Derived::Derived() (/home/mu/Projekte/eigene-webseite/articles/cpp-antipatterns/a.out+0x400975) #2 0x4008c6 in main (/home/mu/Projekte/eigene-webseite/articles/cpp-antipatterns/a.out+0x4008c6) #3 0x7f520c1e0f29 in __libc_start_main (/lib64/libc.so.6+0x20f29) SUMMARY: LeakSanitizer: 4 byte(s) leaked in 1 allocation(s).
When the virtual destructor is commented in, the memory leak disappears.
Formatting & organization
Leading or double underscores
One should not use leading underscores for definitions in normal code. Those are reserved for the standard library or the compiler. When programming with intrinsics like AVX, then one has to use functions and types that start with underscores. But one is not allowed to defined new types with leading underscores.
Two underscores anywhere in the name is also reserved.
A pattern I often see is the usage of leading underscores in include guards:
#ifndef __MY_AWESOME_HEADER_H__ #define __MY_AWESOME_HEADER_H__ // Code #endif
There is no need for the __
, really. Just having MY_AWESOME_HEADER_H
would
be fine as well. Or perhaps just #pragma once
, but that is a different
debate.
Inconsistent indentation
For several years now, there is clang-format. If you do not use that, you are seriously missing out. It will format your whole codebase perfectly and also comes with an editor plugin. You do not have to worry about indentation, it will just be perfect.
Unfortunately it is very hard to apply that to a codebase which has a lot of branches in the git repository. Reformatting all the code will lead to merge conflicts and resolving them will not be any fun. So start your project right away with it or attempt to introduce it at some point.
Inconsistent formatting will make the code harder to read and reason about, something you do not want with code that is already complex.
No comments
Actually, I think that having no comments is great to some point. Compare
double f(double x) { double c = 235; double z = c / x; return z; }
with
// Convert {miles per gallon} to {liters per 100 km}. double f(double x) { // Magic conversion factor. double c = 235; // Do the conversion. double z = c / x; // Return the result. return z; }
and with
double mpg_to_l100km(double const mpg) { double const conversion_factor = 235; double const l100km = conversion_factor / mpg; return l100km; }
I think that good variable and function names can make a huge difference and eliminating the need for comments. If you can write your code so expressive that you do not need the comments, that is better than commenting bad code.
No formal docstrings
Python had docstrings, Java has JavaDoc and C++ has CppDoc you can set with Sphinx, Epidoc, JavaDoc and Doxygen. Not having any of those makes some tricky functions harder to use than needed in case the function is not as trivial as the name might suggest.
Outdated comments
Even worse than no comments are comments that do not match the code. Then you are left to figure out what happens. Hint: The compiler only reads the code.
Complex variable names
What does the variable nyHP1
contain? Isn't it obvious that it is "number of
elements in y-direction halved plus one"?
Bugtracking and changelog in the source itself
In the (to me) ancient days when one did not have version control or a bug tracking system, writing those in text files next to the source seemed to make sense. Today there is no excuse to track your bugs in top of the one source file.
3400 lines in a single file
When a file is too long, you cannot keep its structure in your mind. After a couple hundred lines, you should start a new file. That is also a good opportunity to refactor the code into smaller pieces.
Of course this might increase compilation time. Then something like a unity build might be appropriate again. However, this is a distinct step and makes the file more readable.
Single function with 800 lines
Having a single function with 800 lines is just impossible to reason about. You cannot even remember the variables you have declared. It is almost as bad as if they were global.
Superfluous blank lines
Occasionally I see lots and lots of blank lines in code. One or perhaps even two blank lines can help grouping various parts of a file. When there are five or mode blank lines in a row, it dilutes the information density on screen, I have to scroll to read it and cannot have a whole function on my screen. Getting rid of those superfluous blank lines makes the code easier to read.
Duplicating source code
Say there is a function that does a particular task. Then there is a need for a second task which is almost like the first task, but different at a couple of different points. It might be tempting to just duplicate all the code and change the few bits.
In the long run, the parts which used to be the same will diverge, though. And then the real cost of duplicating the code will show up: Do you have to copy the changes? It would be better to abstract the first task such that there is a generic task and the first task is just a specialization of that. Then the second task could be added as another specialization without duplicating any code.
In the real world, this is not that simple. It might be worthwhile to start duplicating the code and get the feature implemented. Then later on, one can find a good abstraction for the two tasks. Finding the abstraction beforehand might now be feasible.
Hard to read for
loops
Sometimes a loop has to be done over a few elements only. For instance 1
and
-1
. Then writing a loop like this would work:
for (int i = 1; i >= -1; i -= 2)
However, it takes a while to get that it iterates over 1
and -1
. A more
succinct way of writing would be this:
for (int i : {1, -1})
That needs C++11, but that should be available.
Miscellaneous C++ items
Duplicate explicit instantiation
Templates can drive up the compilation time a lot. One way around this is to
define the template functions in a source file and explicitly instantiate them
there. This works if you know all the template parameters already, say you want
to template on float
and double
only. But if you do it, just instantiate
once in the whole project. Otherwise it will become a compilation error.
Needless flushing
A std::endl
is a combined "\n"
and std::flush
. So when one wants to
output a multi-line string, it is better to use "\n"
instead of std::endl
each time. Also supplying std::endl << std::flush
will just make the program
a tad slower, it output to the terminal is the limiting factor already.
Useless cast
Once I saw the line
double x = (double)(1);
This is not really needed. It would be sufficient to do double x = 1;
and
trust that the assignment does the proper cast. Also it would do the same do
double x = 1.0
. Also one could just do auto x = 1.0
for double
, 1.0f
for float
and 1.0l
for long double
. That would not even use a single type
identifier once.
Overloaded functions with different tasks
The C++ overload is very handy do have things like void foo(char const *)
and
void foo(std::string const &)
. You can pass both things and the correct
function will be used.
Sometimes there are overloads which do vastly different things. There is a
void do_stuff(int)
which does something completely different from
void do_stuff(double, size_t)
. How is the user supposed do understand this?
Version control
Not using version control
If you are not using version control, you are probably having a harder time to collaborate or develop new features than it would need to be.
I have seen one group of four people that have send around tar archives and
used the diff
tool manually. That was in 2015, git has been 10 years old and
there is no excuse, really.
Unstable "master"
The master
branch is supposed to be the branch that you can checkout and
compile, always. If you want to develop a new feature and it is not stable
yet, have in a development
branch or some feature branch.
I expect that a simple git clone
(which checkouts master
) will give me
working code.
Outdated "master"
Doing all the work on develop
and never merging back into master
keeps that
branch stable, but it is also not really cool. In one project the master
was
outdated and unstable. Great combination.
Abandoned branches
A lot of git repositories have a lot of branches that seem somewhat
abandoned. The commit messages promise some great feature that the develop
branch lacks. However, the branches have not seen a commit in over a year. So
are they still relevant?
Checking in build files
Way too often build files like .o
files are checked into git. This means that
every time you just compile the code, git will tell you about changes to the
files. Now you can just commit them or leave them uncommitted. Usually they get
committed and then your git log is just polluted with changes to files that
should not have been tracked in the first place.
With git submodules that is really nasty because a simple compilation of the
project will leave all the submodules in the dirty
state. You will have to go
through the submodules recursively and do a git reset HEAD --hard
in order to
get rid of the changes. But that resets your compilation again.
The only solution is to do git rm --cached
on those files and put them into a
.gitignore
.
Inconsistent version tags
There are many versions to create tags for versions: Leading v
, periods or
hyphens or underscores between the digits, leading zeros. But please be
consistent, otherwise it will be a nightmare to find the latest version.
Git submodules via GitHub SSH
When repositories are hosted on GitHub, one can either use them as submodules over SSH or HTTPS. The advantage of HTTPS is that it does not need any sort of authentication in order to read. For SSH, you need to have an SSH key registered with GitHub. If you do that on your personal workstation where you have your SSH key configured with GitHub, that is not a problem at all.
On the clusters and supercomputers, I do no have my personal SSH key. And I am not going to copy my personal key to a machine which I do not control. I could create a SSH key on the cluster and add that to my GitHub account. But then you could use that SSH key to make changes to all the repositories on my GitHub account. Not cool.
Creating a whole second GitHub account is possible, but you cannot have more than one free account. So I would do get a second paid account for my master thesis work. Also something I would like to avoid.
The easiest solution is to convert the submodules over to HTTPS. Then you can just download them everywhere. Pushing changes over HTTPS is not that nice because you need to enter your password every single time. So I can understand why SSH is appealing. Also there are apparently some setups where HTTPS is blocked by SSH is not.
Switching it over to HTTPS would it make easier for people to clone the repository. Then if you want to push something, just add an SSH based remote to the repository and you can push. That way, both sides have it usable.
Unmergable forks
Say one has a project that uses the double
data type everywhere. One wants to
let it run with float
because that is a bit faster and one does not need the
precision. The easy way would be to fork the project, create a new branch and
just replace every double
with float
. Problem solved.
That cannot be sensibly merged, though. The upstream author probably still
wants to have double
in their version of the code. The other person wants
float
. Now the two forks are divergent and there is no sensible way to ever
get them together again. Even worse, further changes cannot be shared without
manually replacing all the double
with float
or the other way around.
It is better to solve this via abstraction. Work with the original author to abstract away the actual floating point data type used. Something like this, perhaps:
#ifdef SINGLE_PRECISION typedef float Float; #else typedef double Float; #endif
Then you change every double
to Float
in the codebase once and you can
easily share the code back and forth. Actually, there is no need for a long
running fork any more, the two people can just use the same code all the time
and compile it how they like.
One should be careful not do overdo this, though. Exposing an option for everything in your program will make it a nightmare to actually compile and the number of combinations increases exponentially. In this particular case, where two people have actual needs, it makes more sense to solve it with abstraction than a hard fork of the whole code.
Dead code
Code might not be needed any more at some point. In order to quickly get rid of the code, one can comment it out. On the long run, one should just delete the code and trust the version control to remember it. Otherwise the codebase is littered with old code that has been removed and does not serve any purpose any more.
Build System
Hardcoding paths
Some parts of the build systems might have hardcoded paths in them. It might work well on the developer's machine but will lead to unresolvable paths on every other machine.
Specific version of GNU Autotools
On one supercomputer, I have GNU Autotools 1.11 and 1.14 available. The project that I have downloaded ships with some files that can easily be generated (see above). Those files explicitly refer to GNU Autotools 1.13 and therefore fail to build for me.
I had to remove files like configure
and aclocal.m4
and recreate those with
some calls to automake --add-missing
and autoreconf
. After that I was able
to run configure
.
Patch the code to build / compile branches
One project has to run on different architectures and with different MPI
implementations and versions of compilers. Since the MPI implementations differ
by some const
in functions like MPI_Send
, you cannot use const
-aware code
with an implementation which is unaware of it. The code had to be patched
before building on different architectures. There were a lot of branches in git
just for compiling.
This makes bisecting impossible and also packaging this for a distribution is hard. Debian and especially Fedora frown about patching the source for reasons like this. The build system should expose all the options needed to build it on the target platforms.