R Horror, aka “HorroR”

I have been using R for a while now, there are a few things that seem outright horrifying to me. Don’t get me wrong, R is a great tool. I just find the language a bit bewildering.

Dot and dollar

In almost every other language, the dot is a namespace separator. Say you have in C++:

Myclass instance(3);
instance.do_stuff();

Not so in R! There, the . is just what you use instead of the underscore. There are functions like install.packages and update.packages and read.table which in other languages would be called install_packages and so on.

But the S3 object system hooks into these dots. So when you have an object of type myclass and call plot(instance), that function will forward to plot.myclass(instance). So the dot also has some sort of hierarchy.

And what is the dot in other languages is the dollar in R.

Really confusing at first, after a while one does think about it too much any more.

Abbreviating named function parameters

When you have a function, you can call it with named parameters. Take the following function:

f <- function (parameter = NA, argument = NA) {
    cat('parameter:', parameter, '\n')
    cat('argument:', argument, '\n')
}

It has two named arguments, both are set to NA by default. Let us call it with just the second one set to some value. We do not need to type out the whole name, we can just type out arg:

f(arg = 1)

The output that we get is the following:

parameter: NA
argument: 1

As you can see, the 1 has been passed to the parameter argument although we just typed arg in the function call.

Sounds great? Let me change your mind. Say the author of the function adds a second argument which is a substring of the other, like so:

f <- function (parameter = NA, argument = NA, arg = NA) {
    cat('parameter:', parameter, '\n')
    cat('argument:', argument, '\n')
    cat('arg:', arg, '\n')
}

The call to the function has not changed: f(arg = 1). But the output has:

parameter: NA
argument: NA
arg: 1

There is no warning from the runtime that you have abbreviated a parameter name. Also the runtime has no way of knowing that you wanted to have the other argument.

There is some little solace. Namely when you have two parameters that share a common prefix but one is not a substring of the other. An example would be this:

f <- function (parameter = NA, argument = NA, argparse = NA) {
    cat('parameter:', parameter, '\n')
    cat('argument:', argument, '\n')
    cat('argparse:', argparse, '\n')
}

When you run f(arg = 1) as before, you finally get an error:

Error in h(arg = 1) : argument 1 matches multiple formal arguments
Execution halted

In my code there could be such time bombs and there seems to be no way of getting any error for doing this.

To make things worse, this also plays in a bad with with the ... syntax. Say you have a function which is just a wrapper of another function and want to allow the user to pass extra arguments. Say we want to wrap the function plot(x, y, col = 'black') such that we can directly plot a data frame with columns named x and y. We would write this as such:

plot_wrap <- function (df, ...) {
    plot(df$x, df$y, ...)
}

So far, this is not a problem. However, we could also have done it more generally in this way:

plot_wrap <- function (df, columns = c('x', 'y'), ...) {
    plot(df[columns[1]], df[columns[2]], ...)
}

This way our function is more general and supports other column names as well. Now suppose I have such a data frame and want to plot it in red. I write plot_wrap(df, col = 'red') in the hope that the color argument gets passed down to plot. But R will interpret the col = as a shorthand for columns = and then there will be really strange error messages.

Assignment operator

R has five assignment operators:

  • <-
  • <<-
  • ->
  • ->>
  • =

The first one is a normal assignment, x <- 3. The second one works the same, it just looks for a variable in other scopes before creating a local shadowing variable. The third and fourth one are just mirrored versions, such that you can write 3 -> x. At some point in the past, this must have seemed like a really great idea.

Since this is confusing for people coming from other programming languages where = is the assignment operator (C, C++, Python, Haskell, PHP, Java, JavaScript), modern versions of R also have this = assignment operator. Now there are ideological wars on whether <- or = is the correct one to use.

The thing is that <- is a token consisting of two characters, each of them are a valid token by themselves! So we can write both x <- 5 and also x < -5. The first assigns 5 to x (x := 5), the second does the comparison x < -5. This in itself is not such a big deal, both cases are easy to read for a human.

But what happens if you have code by a person who does not put spaces around operators? Their code will feature x<-5. What does it do? It will be parsed as x <- 5. But can you be certain that the author did not mean x < -5? In C++ you can just write x<-5 and it will mean x < -5 because there is no <- operator.

In practice this might not be an actual problem. The values of the expressions are different. But perhaps the code with if (x<-5) does the apparently right thing today but not tomorrow.

Puns everywhere

R is riddled with puns. A lot package names are something and an “R” put into it. There is “knitr”, “tidyr”, “stringr”. The “knitr” package can generate beautiful reports from R document. These reports are documents with headings and text interleaved with R code. This is my preferred format for things like experimental reports where you want to document something and show off your data.

The methods in this package are called knit and purl. I knew the first word, the activity of converting wool into garments. But “purl” I had to look up, it seems to be be similar but into the opposite direction. One could argue that rmd_to_md and rmd_to_r would be better descriptions, but it would also be bourgeois and not funny. The problem with those names is that you cannot search for them, nor can you remember or guess them easily.

But I won’t complain about the “tidyverse” packages because they have fundamentally changed the way I work with data and “ggplot2” is hands down the best plotting system I have ever tried.

Missing arguments

In languages like Python and C++ there is a concept of optional arguments. In R, this also exists, but with a slight twist.

Define function like this in Python:

def py_func(a, b):
    print(a)

The argument b is not used within the function, but bear with me.

If you try to call this as py_func(1), it will fail loud and clear:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: py_func() missing 1 required positional argument: 'b'

What you could do is to define the function with a default value for b, like this:

def py_func(a, b=2):
    print(a)

Now calling it as py_func(1) works just fine, because this gets called as py_func(1, 2).

In C++ it is exactly the same way, you need to specify all the arguments that have no default value.

In R, this is not the case. When you declare a function analogously to the first example, it would look like this:

r_func <- function(a, b) {
    print(a)
}

Now calling r_func(1) will work! It will print out the 1 and does not complain. Only when we call r_func() without any arguments, it will complain that the parameter a was not passed a value and that there is no default value.

This means that when you write a function, you cannot be certain that the caller passed something for each parameter. Only when you do some computation with one of the parameters, it will crash. Sometimes you even want to make a parameter optional without it having a default value. In languages like Python or C++ you would have to use some neutral sentinel value like Null (Python) or nullptr (C++). However, there still is a difference between a parameter not being passed at all and it explicitly having passed the value NULL (R).

In R there is the function missing which will check whether the parameter was missing in the call. Therefore you see in R code things like this:

r_func <- function(a, b) {
    if (missing(a)) {
        stop("Parameter `a` is missing and needs to be passed!")
    }
    if (missing(b)) {
        stop("Parameter `b` is missing and needs to be passed!")
    }
}

There is no way to know from the function signature which parameters are mandatory and which are optional. You need to look into the documentation or even the function definition to figure that out.

Data frame access operator

The data.frame class has two access operators: [ and [[. The first selects a subset of a data frame, the second a single element only. The row and column arguments to the first variant can be vectors with length greater than one.

You can specify zero, one, or two arguments. And also R allows you to have arguments missing explicitly or implicitly. Therefore we can write stuff like df[, ] which are two implicitly missing arguments. The function data.frame.[ does all sorts of stuff when you have different arguments.

Let’s agree on the following data frame:

df <- data.frame(a = 10:14, b = 15:19, c = 20:24)

The variable df then contains this:

   a  b  c
1 10 15 20
2 11 16 21
3 12 17 22
4 13 18 23
5 14 19 24

However, we can also obtain this using df[] and df[, ]. These give us the whole data frame.

In order to select two columns from it, we can use df[c(1, 2)] and obtain a new data frame with just these columns:

   a  b
1 10 15
2 11 16
3 12 17
4 13 18
5 14 19

We can also write this as df[, c(1, 2)] to mean “all rows, columns 1 and 2”. It gives the same result.

In order to get two rows, we do df[c(3, 4), ] and obtain another data frame:

   a  b  c
3 12 17 22
4 13 18 23

But what happens when we only happen to select a single column this way? When we use df[1], we obtain a data frame:

   a
1 10
2 11
3 12
4 13
5 14

But the seemingly equivalent df[, 1] gives us a vector:

[1] 10 11 12 13 14

You can also get the vector using df$a with the named column.

We can force it to give us a data frame with another argument, df[, 1, drop = FALSE]. Then we get the same result as df[1].

Quiz

Here is a little quiz: What are the return types of the following expressions? The answer is shown when you mouse over.

Question

What is the type of df?

Spoiler

data frame


Question

What is the type of df[]?

Spoiler

data frame


Question

What is the type of df[, ]?

Spoiler

data frame


Question

What is the type of df[1, ]?

Spoiler

data frame


Question

What is the type of df[1]?

Spoiler

data frame


Question

What is the type of df[, 1]?

Spoiler

vector


Question

What is the type of df[, 1, drop = FALSE]?

Spoiler

data frame


Question

What is the type of df[c(2, 3), c(1, 2)]?

Spoiler

data frame

Logical AND

Like in C and C++, there is the & and && operators. In R, they are somewhat similar but different. The & does a normal element-wise AND-operation. The && only takes the first elements. I do not understand why it makes sense to define the operation c(T, F) && c(T, T) to be true, but that seems to be just another quirk in the list. Why would one have boolean vectors and only work with the first elements?