Library Instead of Language

There are many programming languages out there, and I can learn only so many of them. Languages have a different focus and therefore there are cases where I can completely understand that somebody created a new language.

One example would be Haskell that is completely different from C. There is no way to make either language solve the problems that the other one does. C is a lightweight abstraction over assembly language and allows to write portable code that directly addresses the hardware. Haskell has a rather mathematical approach to programming and does not care about the hardware but rather mathematical purity. C does not care about that.

There is also no way that those languages could be merged sensibly, it makes sense to have them existing as two different languages.

Matrix based languages

An instance where I perceive the languages as redundant are in the matrix based languages:

Most of them feel redundant to me. In the details they are all different, of course. However, this will later on be in favor of my point.

One has to have a look at the historical order of the languages in order to understand why this has happened. These are the languages in their order of release.

Software First Release License
C 1972
R 1993 (S: 1976) GNU GPLv2
IDL 1977 Commercial
MATLAB 1984 Commercial
GNU Octave 1988 GNU GPLv3
Python 1991 Python Software Foundation License
SciPy 2001 BSD-3-Clause

R is ordered by the date of the S language because R is just a clone of S, just as GNU Octave is an implementation of the MATLAB language.

So when I now say that R should be a Python library instead of standalone language, it is only a conclusion that one can draw afterwards.

The cost of standalone languages

In 2011 I had some contact with IDL and only knew "scalar" languages like C and Java before. So it seemed like a very cool concept to have matrix as first-class data types in the language. Later on I used GNU Octave for my laboratory work, then switching to Python with SciPy. These days I am making the transition to R in order to have another language in my toolbox and also to use existing software in my work group.

Duplicate implementations

While I learn R, I often think that this could have been implemented in Python as well. A lot of the functionality is supported in the SciPy suite already. Even worse, there are R packages that do things which are already implemented in Python. So a lot of effort is spend on re-implementing things in a different language.

The most frustrating thing is seeing something where a Python module exists but nothing for R (or vice versa). Now in R, I cannot just import a Python module, it has to be implemented again in R. Perhaps the Python implementation is just a wrapper around a C library. In this case one can write an R wrapper as well and is done quickly.

Combining different things

If R was a library for Python, this problem would not arise. The amount of code that could be used together would be much bigger. One could write an application that has a Qt user interface but uses R for statistics in the background. With SciPy I can do exactly that because it is a Python library.

A lot of common algorithms have to be implemented twice in order to use them in both ecosystems. This seems like a giant waste to me. Also when I write something, I have to choose which language I use. And then it will only be available to the users of that particular language.

For my laboratory reports I have used Jinja which is a template engine primarily built to generate HTML. I generate LaTeX with that and then compile it with pdfLaTeX or LuaLaTeX. In the very same program I use SciPy to do the actual analysis. In this particular case I did not really need it to be a single program. The data is persisted as JSON and then read again to be put into the template. With a JSON library for R I could just have written the analysis in R and still used a Python script with Jinja to generate the report.

By defining clear interfaces between processing steps, one can combine programs written in different programming languages. But sometimes it is much easier if one can just combine it right in the same program.

More syntax to learn

One always has to learn the API of the library that one wants to use, there is no way around. However, when the number of programming languages is limited, one does not have to learn a new syntax.

Also the syntax of Python is flexible enough to do various things that might not be available in other languages. One example is using a container of values as arguments of a function. In Python one can do the following:

def f(a, b, c):
    print(a)
    print(b)
    print(c)

args = ['A', 'B', 'C']
f(*args)

This will call the function f with f('A', 'B', 'C') because the asterisk unwraps the list. This is really handy occasionally.

R also has this, but I find it more cumbersome:

f <- function(a, b, c) {
    cat(a, '\n', b, '\n', c, '\n')
}

args <- list('A', 'B', 'C')
do.call(f, args)

Another issue is that Python has classes and objects built-in. In R these are available as well, but they feel somewhat bolted-on at some later point. I have not used them yet actively because I do not see the point in learning yet another way of object oriented programming.

Fragmentation of code

In your organization you might have a lot of existing code in R. Some people can perhaps only program in Python and are not willing to learn R in order to do their little data analysis. So they use Python instead. Soon you will have different code bases that do data analysis things but cannot be used together.

This might be a nice cross-check, but all functionality has to be ported explicitly between the two languages. If there was only one language, this problem would not have occurred.

Build system, package management, ...

Every language has at least one build system, there is at least one package management. All these things have to be maintained and learned by the users.

There are also secondary systems like IPython Notebooks for Python and Knitr for R. These allow to make documents which consist of usual markup with cells of code in between. One can run the code and the results are shown there. This makes writing technical documents very nice.

Graphical interfaces have to be written for each language as well. For Python, there are various IDEs. For R, there are also a few IDEs. All the work put in these has to be replicated for the other ones. RStudio offers me to look at the variables that are defined interactively. This is really nice to get an overview of nested variables. I am not sure whether there is something similar for Python.

Python is flexible enough

I would assume that in the time that R (and S, for that matter) was created, there was only C and FORTRAN around. And compared to those, it makes a lot of sense to have a new language. Also C and FORTRAN are not powerful enough to make R a sensible library.

Say we want to create the values of a sine for points between 0 and 1 with 10 sampling points and then multiply them by two and add one. In C, you would have to do the following:

#include <math.h>

double *make_linspace(double begin, double end, int steps) {
    double *x = malloc(steps * sizeof(*x));

    for (int i = 0; i < steps, ++steps) {
        x[i] = begin + (end - begin) * i / (steps - 1);
    }

    return x;
}

int main() {
    int steps = 10;
    double x = make_linspace(0.0, 1.0, steps);
    double y = malloc(steps * sizeof(*y));
    double z = malloc(steps * sizeof(*y));

    for (int i = 0; i < steps, ++steps) {
        y[i] = sin(x[i]);
    }

    for (int i = 0; i < steps, ++steps) {
        z[i] = 2.0 * y[i] + 1.0;
    }

    free(x);
    free(y);
    free(z);
}

The make_linspace function would be in the library, the main would have to be written by the user. This is not very comfortable. Also the manual allocation of error is very error-prone.

I am not sure what R looked back at the day, but today you can write this as simple as this:

x <- seq(0.0, 1.0, length.out = 10)
y <- sin(x)
z <- 2.0 * y + 1.0

But today with Python, you can also do this very nicely:

from pylab import *

x = linspace(0.0, 1.0, 10)
y = sin(x)
z = 2.0 * y + 1.0

The names are different, the syntax is slightly different, but the essence is the same: x and y are vectors and sin is a function that operates on a whole vector at the same time. This makes programming much easier than with C.

Python can only do this, because x, y and z are objects of type numpy.ndarray. For them, the operator * and + is specified with the magic methods __mul__ and __add__. The NumPy developers have put the loop over the vector into this function. The user only has to use * and the for-loop happens invisibly. A language needs classes and operator overloading in order to provide the tools needed to create libraries that are nice to use like NumPy.

For data frames, a major thing of R, there is the Pandas library. I did a few things with it, but I cannot use it fluently. This is the point where the syntax of R is tailor made for data frames, but Python does not have it directly.

In R we can group the three vectors into a data frame like this:

df <- data.frame(x, y, z)

This creates the following data structure:

           x         y        z
1  0.0000000 0.0000000 1.000000
2  0.1111111 0.1108826 1.221765
3  0.2222222 0.2203977 1.440795
4  0.3333333 0.3271947 1.654389
5  0.4444444 0.4299564 1.859913
6  0.5555556 0.5274154 2.054831
7  0.6666667 0.6183698 2.236740
8  0.7777778 0.7016979 2.403396
9  0.8888889 0.7763719 2.552744
10 1.0000000 0.8414710 2.682942

Accessing the elements works in various ways. First one can do df$x and extract the column x. This is equivalent to df[ , 1]. Then one can write df[1, ] to extract the first row. Something like df[1, 2] selects the first row and second column.

With Python and Pandas, one can do similar things. The syntax is just a little more general:

df = pd.DataFrame(dict(x=x, y=y, z=z))

At first I found it amazing that R would directly use x, y and z as the column labels of the data frame. In the languages that I knew prior, the function data.frame is only passed the values of the expressions in the parameters, but not the name. However, R can "deparse" this and use the source code of the expression as a string internally. This is not possible in Python, therefore I have to specify the names again.

The output is almost identical, except that Python starts counting with zero:

          x         y         z
0  0.000000  0.000000  1.000000
1  0.111111  0.110883  1.221765
2  0.222222  0.220398  1.440795
3  0.333333  0.327195  1.654389
4  0.444444  0.429956  1.859913
5  0.555556  0.527415  2.054831
6  0.666667  0.618370  2.236740
7  0.777778  0.701698  2.403396
8  0.888889  0.776372  2.552744
9  1.000000  0.841471  2.682942

In order to get the column x, one has to write df['x'], which is a lot more typing. However, one can also do df.x which is much nicer to type. In order to get the first row, one has to do df[0:1], which is a inclusive-exclusive interval. In R one can do the same to select multiple rows: df[1:3, ] selects rows 1, 2 and 3 with all the columns.

It seems that Python as a language is powerful enough to provide the basic data structures of R nicely without adding too much useless syntax. The exact differences probably only come about when one is on the very edge of the possibilities. I have heard that Pandas is not as powerful as R regarding data frame management, but I do not have any experience with that myself.

Going forward

The problem in this situation is that both R and SciPy communities are large and mature. Both have vast amounts of existing code that cannot be rewritten without being a major cost. Also the users of each system are likely happy with their environment and do not want it to change.

Over time I would appreciate if special-purpose languages like R would fade out and their functionality became a library in a general-purpose language like Python. This way the functionality would not be lost and the number of languages one has to learn would get smaller.

General-purpose languages need to provide language features that makes writing libraries a pleasant experience. Good languages that I know in this regard are C++ and Python. Both are extremely flexible in terms of library building and allow for user-friendly libraries.

There is a need for languages that make writing highly parallel code easier. Languages like Julia are made for high performance programming, but it is yet another ecosystem. C++ libraries like CUDA, OpenMP or Kokkos are usable with existing code and make a controlled transition possible without rewriting everything.