Programming Toolbox

For my work, I use various tools. For every task, some tools are better suited than other. However, a mostly-suited tool that you know is better than the perfect tool that you don't know. Therefore I found it helpful to know at least one representative from various categories well. If one knows a different one from a category, one has a relatively easy time picking up the other ones if needed.

The following is a list of the categories and tools that I know of, sometimes I describe the tool in more detail. The purpose of this article is to make you aware of a tool that could turn out helpful in your work.

Native Languages

Native programming languages give you access to the full performance of the system. In the work with supercomputers, we must use the granted resources efficiently. Some examples are:

  • C
  • C++
  • FORTRAN
  • Objective-C
  • Rust

I know C++ and C, though I have never programmed anything FORTRAN. So far this has not been a problem and from what I have seen, I'd probably want it to stay this way.

Managed Languages

A managed language has a garbage collector and in general a runtime that manages the execution of your program beyond the operating system kernel. Examples would be:

  • C#
  • Java
  • Kotlin
  • Scala
  • Swift

These languages have the advantage that one does not have to worry about memory leaks (like on has in C and in some cases even C++). Developer productivity usually is higher than with the native languages.

The price of the abstraction is that one does not have full access to the hardware any more. In compute-intensive programs this will make a difference. Also when resources are scarce (mobile, embedded), these languages would not be my first choice. For everything else, the performance difference due to the runtime will easily compensated by algorithmic improvements.

From this category I only have some experience with Java, and that dates back to 2012.

Script Languages

For some tasks, where development is even more ad-hoc, resources are not limited, script languages are a good choice. Contrary to the native and managed languages, there is no compilation step. One can just run the code through the interpreter and quickly iterate. Also the interpreters usually offer an interactive mode. Common ones are:

  • Perl
  • Python
  • Ruby
  • Tcl

I picked up Python in 2011 when I needed something to process data with. Java was too inflexible to massage text files, C was even worse. So I started to work through the tutorial and quickly started to like it. More on data analysis languages can be found in the section about matrix-based languages.

Shell Languages

The shell is the program that runs in the terminal and takes your commands. This type of usage is somewhat common on Linux, on Windows and macOS pretty much nobody uses the shell. Examples are:

  • Bash
  • Fish Shell
  • PowerShell
  • zsh

The power of the shell is that it can wrap together a process that involves various other programs, potentially written in any programming language. The common interface are command line parameters, exit status, and standard input and output.

Bash

Say you have a folder of many (so many that converting them manually would take you days) pictures that you want to resize to 1000 by 1000 pixels to send someone via email. Sure, you can find some program that lets you batch process your pictures. But what happens if you want to rename them at the same time, putting the number of pixels into it?

In bash, you could do this:

for pic in *.jpg
do
    echo "Converting $pic."
    convert "$pic" -resize 1000x1000 "${pic/.jpg/-1000.jpg}"
done

For the easy tasks, Bash scripts might be a good tool. For anything slightly larger, I would definitely recommend using a fully featured programming language like Python. See my dislike for Bash.

Fish Shell

The Fish shell is a lot saner than the Bash. It also supports handy completion right out of the box. The zsh can do all those things, but you need to configure them first. For scripts I still use bash because fish is installed virtually nowhere.

Matrix-based Languages

For data analysis, there are GUI tools like Origin and SPSS, which I have never used. I prefer to write code and be able to reproduce a particular analysis. Therefore I gravitate towards the ones which are built around a programming language. Of this kind, I know of the following:

  • IDL
  • Mathematica
  • Matlab
  • Octave
  • Python + NumPy + SciPy + Matplotlib + Pandas
  • R

Python

Python with the appropriate libraries is a good tool to do data analysis. In various areas of physics this is the dominant language. The fact that Python is a universal programming language is an advantage and a disadvantage at the same time. You can combine all sorts of Python libraries in your program and you are not limited to numerical and statistical operations. However, the syntax is a bit more verbose than other languages.

With the IPython/Jupyter notebooks this can be used with a nice interactive notebook interface, like Mathematica has.

R

I started to use R in the past few months because the people in my workgroup who do not use Python use R. In order to use their code as well, I started to look into it. The syntax is a bit cumbersome at first, but it is more concise than Python on occasion. Other aspects are solved less elegantly, so this is not a clear win. The data frame structure that Python only obtains with Pandas is native to R, one could say that Python + Pandas is a bit like R, just with a more verbose syntax.

The open source IDE RStudio allows one to explore the currently defined variables. One can also change the code in the editor, source the file and then play around in the interactive session.

Functional Languages

Functional languages are supposed to be much different from the other procedural languages. The ones that I know of are:

  • Clojure
  • Haskell
  • Lisp
  • Racket

I have looked into Haskell and it has changed the way that I think about programs a little bit. Also some of the constructs that are proposed for newer C++ standards seem to originate in the functional languages.

Web Development

I partly got into programming via web development. On the client (browser) side, you will need to know the following:

  • HTML
  • CSS
  • JavaScript

JavaScript is optional if you can do without direct interactivity. On the server side, you need one of these:

  • C# + ASP.Net
  • Java
  • JavaScript + Node.js
  • PHP
  • Python + {Django, Flask}
  • Ruby + Rails

I have some experience with PHP, and it is not really my favorite language. Then I worked a bit with Python and web server libraries, but it is somewhat disappointing. Most shared hosters only support PHP, a virtual server costs more than a shared hoster. Eventually I just started to make static websites, just like this website. This is great for technical documents where there is no user interaction or database involved. There is a huge number of these generators, at least there is a list to give you an overview. For this website, I use Sphinx.

Version Control

Version control (or source code management) is a tool that takes snapshots of a directory and lets you go back to previous versions and show the differences. So instead of having multiple copies of files or even whole directories laying around, these tools just lets me work in that single directory and keeps track of the changes. Popular ones are:

  • Bazaar
  • git
  • Mercurial
  • Perforce
  • SVN

I started with Bazaar and eventually switched to git.

Git

It has a steep learning curve to it, but in the long run it seems to be just a clear cut tool without a layer of "easy interface" that actually hides most of the program's potential.

Markup Language

There are many ways to represent formatting in text. Some are big formats that are used in GUI word processors, like Rich Text Format (RTF). The following are markup languages that can be written with a text editor and also read by humans.

  • ASCII Doc
  • BBCode
  • docbook
  • DokuWiki
  • HTML
  • LaTeX
  • Markdown
  • Mediawiki
  • Orgmode
  • reStructuredText
  • roff / groff / man
  • Textile / Redcloth

The languages have a different "weight" in markup, some are easy to read (Markdown), others are rather hard to read in the bare format (HTML). If you need to convert between these languages, have a look at Pandoc.

Markdown

Markdown is now in a lot of places. It is very lightweight and easy to learn. The original version lacks a few features (tables, source code highlighting, definition lists) and was extended by various parties in inconsistent ways. The GitHub variant seems to be somewhat popular.

Usages include:

  • GitHub issues
  • Stack Overflow postings
  • Documentation in Doxygen
  • Ghost blog posts

reStructuredText

Just like Markdown, reStructuredText is a lightweight markup language for plain text. It has more features than the latter, is a little more complex, but has a wider support and more conversion tools. You can easily make UNIX man pages or HTML5 slide shows, even LaTeX documents out of it with rst2man, rst2s5, rst2latex and even OpenDocument (for LibreOffice and OpenOffice.org) with rst2odt.

These pages are written in reStructuredText and then converted into HTML. For every page, there should be a .rst version around as well. For this page, check out the source by clicking on the "Show reST" link in the navigation bar.

reStructuredText has a clearly defined extension syntax, therefore one can just add new directives as needed.

Text Editor

A good text editor is very helpful when one programs in different languages. I usually do not use IDEs and just program with Vim. This is helpful because I have the same program for various types of files. If I was to use one IDE per language, it would quickly become very inconsistent.

Documentation Generator

For most programming languages, there is some tool that can extract special comments from the code and generate browsable HTML documentation from it. This can be very nice for users of a library that you write.

Tool Language
Doxygen C, C++
Epydoc Python
Sphinx Python, C++, …

Data Format

Configuration files for programs can be designed from scratch. This unfortunately happens often, people start with a plain text file and parse this themselves in their program. Later on the format needs to be extended and then hacks are built on top of kludges. It is better to use a standard format:

  • CSV
  • HDF5
  • INI
  • JSON
  • XML
  • YAML

The advantage with these formats is that you have extensive tooling:

  • Syntax highlighting and indentation support in text editors
  • Parsing libraries in various programming languages
  • These libraries can also write out the data in the same format, allowing to generate configuration files.
  • Helper programs like xmldiff or hdf5diff can be used directly without having to write something for the self-designed data format.

With the exception of CSV and INI, the syntax is flexible enough to allow for virtually arbitrary data structures. Also they are nestable, so you can include the complete input data in your output as part of the data.

The following is a bit information about the formats, sorted from simple to complex.

  • CSV can store rectangular table data, there are only strings, no data type. For tabular data or matrices, this format is suited and can be used with a lot of programs.

  • INI is great for very simple configurations. It has a one level hierarchy and can only represent strings. It is easy to read by humans.

  • JSON has an arbitrary amount of nesting, it knows of strings, integers, floats, as well as lists and maps. It is great for various data exchange, it is readable but not very pretty.

  • YAML can do everything that JSON can, the syntax just makes it much more readable. Also it has a few extras like object references that make it more powerful.

  • XML can do even more, for configuration files it might be overkill. The advantage is that extensive tooling already exists and that it is a proven technology. It is very verbose and unpleasant to write by hand. Data is stored rather inefficiently, XML is often compressed using GZip.

  • HDF5 is a hierarchical data format with a focus on high performance. I have not used it myself, though I hear that it has a similar hierarchical structure as JSON has. Data is stored in binary and there are implementations with MPI support such that one can use it with parallel file systems.