R Horror, aka "HorroR"

Martin Ueding

2017-12-16

Code & Zahlen

I have been using R for a while now, there are a few things that seem outright horrifying to me. Don't get me wrong, R is a great tool. I just find the language a bit bewildering.

Dot and dollar

In almost every other language, the dot is a namespace separator. Say you have in C++:

Myclass instance(3);
instance.do_stuff();

Not so in R! There, the . is just what you use instead of the underscore. There are functions like install.packages and update.packages and read.table which in other languages would be called install_packages and so on.

But the S3 object system hooks into these dots. So when you have an object of type myclass and call plot(instance), that function will forward to plot.myclass(instance). So the dot also has some sort of hierarchy.

And what is the dot in other languages is the dollar in R.

Really confusing at first, after a while one does think about it too much any more.

Abbreviating named function parameters

When you have a function, you can call it with named parameters. Take the following function:

f <- function (parameter = NA, argument = NA) {
    cat('parameter:', parameter, '\n')
    cat('argument:', argument, '\n')
}

It has two named arguments, both are set to NA by default. Let us call it with just the second one set to some value. We do not need to type out the whole name, we can just type out arg:

f(arg = 1)

The output that we get is the following:

parameter: NA
argument: 1

As you can see, the 1 has been passed to the parameter argument although we just typed arg in the function call.

Sounds great? Let me change your mind. Say the author of the function adds a second argument which is a substring of the other, like so:

f <- function (parameter = NA, argument = NA, arg = NA) {
    cat('parameter:', parameter, '\n')
    cat('argument:', argument, '\n')
    cat('arg:', arg, '\n')
}

The call to the function has not changed: f(arg = 1). But the output has:

parameter: NA
argument: NA
arg: 1

There is no warning from the runtime that you have abbreviated a parameter name. Also the runtime has no way of knowing that you wanted to have the other argument.

There is some little solace. Namely when you have two parameters that share a common prefix but one is not a substring of the other. An example would be this:

f <- function (parameter = NA, argument = NA, argparse = NA) {
    cat('parameter:', parameter, '\n')
    cat('argument:', argument, '\n')
    cat('argparse:', argparse, '\n')
}

When you run f(arg = 1) as before, you finally get an error:

Error in h(arg = 1) : argument 1 matches multiple formal arguments
Execution halted

In my code there could be such time bombs and there seems to be no way of getting any error for doing this.

To make things worse, this also plays in a bad with with the ... syntax. Say you have a function which is just a wrapper of another function and want to allow the user to pass extra arguments. Say we want to wrap the function plot(x, y, col = 'black') such that we can directly plot a data frame with columns named x and y. We would write this as such:

plot_wrap <- function (df, ...) {
    plot(df$x, df$y, ...)
}

So far, this is not a problem. However, we could also have done it more generally in this way:

plot_wrap <- function (df, columns = c('x', 'y'), ...) {
    plot(df[columns[1]], df[columns[2]], ...)
}

This way our function is more general and supports other column names as well. Now suppose I have such a data frame and want to plot it in red. I write plot_wrap(df, col = 'red') in the hope that the color argument gets passed down to plot. But R will interpret the col = as a shorthand for columns = and then there will be really strange error messages.

Assignment operator

R has five assignment operators:

<-
<<-
->
->>
=

The first one is a normal assignment, x <- 3. The second one works the same, it just looks for a variable in other scopes before creating a local shadowing variable. The third and fourth one are just mirrored versions, such that you can write 3 -> x. At some point in the past, this must have seemed like a really great idea.

Since this is confusing for people coming from other programming languages where = is the assignment operator (C, C++, Python, Haskell, PHP, Java, JavaScript), modern versions of R also have this = assignment operator. Now there are ideological wars on whether <- or = is the correct one to use.

The thing is that <- is a token consisting of two characters, each of them are a valid token by themselves! So we can write both x <- 5 and also x < -5. The first assigns 5 to x ($x := 5$), the second does the comparison $x < -5$. This in itself is not such a big deal, both cases are easy to read for a human.

But what happens if you have code by a person who does not put spaces around operators? Their code will feature x<-5. What does it do? It will be parsed as x <- 5. But can you be certain that the author did not mean x < -5? In C++ you can just write x<-5 and it will mean x < -5 because there is no <- operator.

In practice this might not be an actual problem. The values of the expressions are different. But perhaps the code with if (x<-5) does the apparently right thing today but not tomorrow.

Puns everywhere

R is riddled with puns. A lot package names are something and an "R" put into it. There is "knitr", "tidyr", "stringr". The "knitr" package can generate beautiful reports from R document. These reports are documents with headings and text interleaved with R code. This is my preferred format for things like experimental reports where you want to document something and show off your data.

The methods in this package are called knit and purl. I knew the first word, the activity of converting wool into garments. But "purl" I had to look up, it seems to be be similar but into the opposite direction. One could argue that rmd_to_md and rmd_to_r would be better descriptions, but it would also be bourgeois and not funny. The problem with those names is that you cannot search for them, nor can you remember or guess them easily.

But I won't complain about the "tidyverse" packages because they have fundamentally changed the way I work with data and "ggplot2" is hands down the best plotting system I have ever tried.

Missing arguments

In languages like Python and C++ there is a concept of optional arguments. In R, this also exists, but with a slight twist.

Define function like this in Python:

def py_func(a, b):
    print(a)

The argument b is not used within the function, but bear with me.

If you try to call this as py_func(1), it will fail loud and clear:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: py_func() missing 1 required positional argument: 'b'

What you could do is to define the function with a default value for b, like this:

def py_func(a, b=2):
    print(a)

Now calling it as py_func(1) works just fine, because this gets called as py_func(1, 2).

In C++ it is exactly the same way, you need to specify all the arguments that have no default value.

In R, this is not the case. When you declare a function analogously to the first example, it would look like this:

r_func <- function(a, b) {
    print(a)
}

Now calling r_func(1) will work! It will print out the 1 and does not complain. Only when we call r_func() without any arguments, it will complain that the parameter a was not passed a value and that there is no default value.

This means that when you write a function, you cannot be certain that the caller passed something for each parameter. Only when you do some computation with one of the parameters, it will crash. Sometimes you even want to make a parameter optional without it having a default value. In languages like Python or C++ you would have to use some neutral sentinel value like Null (Python) or nullptr (C++). However, there still is a difference between a parameter not being passed at all and it explicitly having passed the value NULL (R).

In R there is the function missing which will check whether the parameter was missing in the call. Therefore you see in R code things like this:

r_func <- function(a, b) {
    if (missing(a)) {
        stop("Parameter `a` is missing and needs to be passed!")
    }
    if (missing(b)) {
        stop("Parameter `b` is missing and needs to be passed!")
    }
}

There is no way to know from the function signature which parameters are mandatory and which are optional. You need to look into the documentation or even the function definition to figure that out.

Data frame access operator

The data.frame class has two access operators: [ and [[. The first selects a subset of a data frame, the second a single element only. The row and column arguments to the first variant can be vectors with length greater than one.

You can specify zero, one, or two arguments. And also R allows you to have arguments missing explicitly or implicitly. Therefore we can write stuff like df[, ] which are two implicitly missing arguments. The function data.frame.[ does all sorts of stuff when you have different arguments.

Let's agree on the following data frame:

df <- data.frame(a = 10:14, b = 15:19, c = 20:24)

The variable df then contains this:

a  b  c

1 10 15 20 2 11 16 21 3 12 17 22 4 13 18 23 5 14 19 24

However, we can also obtain this using df[] and df[, ]. These give us the whole data frame.

In order to select two columns from it, we can use df[c(1, 2)] and obtain a new data frame with just these columns:

a  b

1 10 15 2 11 16 3 12 17 4 13 18 5 14 19

We can also write this as df[, c(1, 2)] to mean "all rows, columns 1 and 2". It gives the same result.

In order to get two rows, we do df[c(3, 4), ] and obtain another data frame:

a  b  c

3 12 17 22 4 13 18 23

But what happens when we only happen to select a single column this way? When we use df[1], we obtain a data frame:

1 10 2 11 3 12 4 13 5 14

But the seemingly equivalent df[, 1] gives us a vector:

[1] 10 11 12 13 14

You can also get the vector using df$a with the named column.

We can force it to give us a data frame with another argument, df[, 1, drop = FALSE]. Then we get the same result as df[1].

Quiz

Here is a little quiz: What are the return types of the following expressions?

What is the type of df?
What is the type of df[]?
What is the type of df[, ]?
What is the type of df[1, ]?
What is the type of df[1]?
What is the type of df[, 1, drop = FALSE]?
What is the type of df[c(2, 3), c(1, 2)]?

It is data.frame for every one. But for the following, it is a vector!

What is the type of df[, 1]?

Logical AND

Like in C and C++, there is the & and && operators. In R, they are somewhat similar but different. The & does a normal element-wise AND-operation. The && only takes the first elements. I do not understand why it makes sense to define the operation c(T, F) && c(T, T) to be true, but that seems to be just another quirk in the list. Why would one have boolean vectors and only work with the first elements?

Index vector lengths

When you have an array with the numbers from 1 to 10 (inclusive), it has length 10. When you try to index that with a boolean index vector of length 3 in Python NumPy, it will fail:

>>> import numpy as np
>>> np.arange(1, 11)[np.arange(1, 4) < 4]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: boolean index did not match indexed array along dimension 0; dimension is 10 but corresponding boolean dimension is 3

When you do the same thing in R, you get no warning whatsoever:

> (1:10)[c(TRUE, TRUE, FALSE)]
[1]  1  2  4  5  7  8 10

It seems that a logical index vector with insufficient length is just padded with TRUE. You can also specify an index vector which is longer than the vector itself, then you get a bunch of NA values:

> (1:10)[1:30 > 3]
[1]  4  5  6  7  8  9 10 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[26] NA NA

So when you by mistake use an index vector meant for rows for the columns, you do not get any error that would help finding that bug.

Unusual function names and argument orders

In most languages that are somewhat functional there is a map command, which takes a function and a container and maps that to a new container. Mathematically we could write

$$\text{map} \colon (X \to Y) \times { X } \to { Y } \,.$$

In Haskell the signature of the function indeed is this:

map :: (a -> b) -> [a] -> [b]

In the Wolfram Language their Map also works the same way: function as the first argument, container as the second argument. Also in Python the map takes the function and then the iterable. Okay, C++ has std::transform, which takes a pair of input iterators, an output iterator and _then the transformation. But that is so weird to use that it does not really count in this comparison.

And then you have R which does not call it "map" but lapply. And to make it worse the argument order is different, so it takes the list _first and the function _second. This is so irritating when I work on my projection consisting of both Wolfram Language and R code that I routinely stumble across this. The mapply function takes the function as a first argument, and the apply function takes the function as the last argument. You can't make this stuff up, except maybe if you are used to PHP.

Inclusive range

In Python and C++ the ranges to access sequences are inclusive at the beginning and exclusive at the end. Also indices start from zero, but that is not so important. If you in Python take seq[4:6], you get the elements with indices 4 and 5, but not 6. In R you would also get the element with index 6. Python's range function will also not give you the last element, so this matches nicely.

I needed to traverse the upper triangle of a matrix. So I did this in R:

for (row in 1:size) {
   for (col in (row + 1):size) {
       i1 <- (row - 1) * size + col
       i2 <- (col - 1) * size + row
       # …
   }

}

The size is the number of rows and columns in the square matrix. If size is 3, then 1:size is 1:3 and that is the sequence 1, 2, 3. But when row has the last value, 3, this gives a problem in the next line. We have (row + 1):size, which evaluates to 4:3. In Python the range(11, 10) would just give an empty sequence. But in R it gives 4, 3!. That means that we access elements that should not be there in the first place.

We therefore need to fix the first row such that we stop one element earlier.

for (row in 1:(size - 1)) {
   for (col in (row + 1):size) {
       i1 <- (row - 1) * size + col
       i2 <- (col - 1) * size + row
       # …
   }

}

This fixes that issue, but we now have a problem if size is just 1. Then row will be in the sequence 1, 0, which will get incorrect indices as well. Therefore one needs another fix here.

if (size > 1) {
    for (row in 1:(size - 1)) {
       for (col in (row + 1):size) {
           i1 <- (row - 1) * size + col
           i2 <- (col - 1) * size + row
           # …
       }
   }

}

So although the inclusive indexing in R is supposed to make it easier for people to think about ranges, it leads to more cumbersome code than it would in C++ or Python.