R Horror, aka "HorroR"
I have been using R for a while now, there are a few things that seem outright horrifying to me. Don't get me wrong, R is a great tool. I just find the language a bit bewildering.
Dot and dollar
In almost every other language, the dot is a namespace separator. Say you have in C++:
Myclass instance(3); instance.do_stuff();
Not so in R! There, the .
is just what you use instead of the underscore.
There are functions like install.packages
and update.packages
and
read.table
which in other languages would be called install_packages
and so
on.
But the S3 object system hooks into these dots. So when you have an object of
type myclass
and call plot(instance)
, that function will forward to
plot.myclass(instance)
. So the dot also has some sort of hierarchy.
And what is the dot in other languages is the dollar in R.
Really confusing at first, after a while one does think about it too much any more.
Abbreviating named function parameters
When you have a function, you can call it with named parameters. Take the following function:
f <- function (parameter = NA, argument = NA) { cat('parameter:', parameter, '\n') cat('argument:', argument, '\n') }
It has two named arguments, both are set to NA
by default. Let us call it
with just the second one set to some value. We do not need to type out the
whole name, we can just type out arg
:
f(arg = 1)
The output that we get is the following:
parameter: NA argument: 1
As you can see, the 1
has been passed to the parameter argument
although we
just typed arg
in the function call.
Sounds great? Let me change your mind. Say the author of the function adds a second argument which is a substring of the other, like so:
f <- function (parameter = NA, argument = NA, arg = NA) { cat('parameter:', parameter, '\n') cat('argument:', argument, '\n') cat('arg:', arg, '\n') }
The call to the function has not changed: f(arg = 1)
. But the output has:
parameter: NA argument: NA arg: 1
There is no warning from the runtime that you have abbreviated a parameter name. Also the runtime has no way of knowing that you wanted to have the other argument.
There is some little solace. Namely when you have two parameters that share a common prefix but one is not a substring of the other. An example would be this:
f <- function (parameter = NA, argument = NA, argparse = NA) { cat('parameter:', parameter, '\n') cat('argument:', argument, '\n') cat('argparse:', argparse, '\n') }
When you run f(arg = 1)
as before, you finally get an error:
Error in h(arg = 1) : argument 1 matches multiple formal arguments Execution halted
In my code there could be such time bombs and there seems to be no way of getting any error for doing this.
To make things worse, this also plays in a bad with with the ...
syntax. Say
you have a function which is just a wrapper of another function and want to
allow the user to pass extra arguments. Say we want to wrap the function
plot(x, y, col = 'black')
such that we can directly plot a data frame with
columns named x
and y
. We would write this as such:
plot_wrap <- function (df, ...) { plot(df$x, df$y, ...) }
So far, this is not a problem. However, we could also have done it more generally in this way:
plot_wrap <- function (df, columns = c('x', 'y'), ...) { plot(df[columns[1]], df[columns[2]], ...) }
This way our function is more general and supports other column names as well.
Now suppose I have such a data frame and want to plot it in red. I write
plot_wrap(df, col = 'red')
in the hope that the color argument gets passed
down to plot
. But R will interpret the col =
as a shorthand for columns =
and then there will be really strange error messages.
Assignment operator
R has five assignment operators:
<-
<<-
->
->>
=
The first one is a normal assignment, x <- 3
. The second one works the same,
it just looks for a variable in other scopes before creating a local shadowing
variable. The third and fourth one are just mirrored versions, such that you
can write 3 -> x
. At some point in the past, this must have seemed like a
really great idea.
Since this is confusing for people coming from other programming languages
where =
is the assignment operator (C, C++, Python, Haskell, PHP, Java,
JavaScript), modern versions of R also have this =
assignment operator. Now
there are ideological wars on whether <-
or =
is the correct one to use.
The thing is that <-
is a token consisting of two characters, each of them
are a valid token by themselves! So we can write both x <- 5
and also
x < -5
. The first assigns 5
to x
($x := 5$), the second does the
comparison $x < -5$. This in itself is not such a big deal, both cases are easy
to read for a human.
But what happens if you have code by a person who does not put spaces around
operators? Their code will feature x<-5
. What does it do? It will be parsed
as x <- 5
. But can you be certain that the author did not mean x < -5
? In
C++ you can just write x<-5
and it will mean x < -5
because there is no
<-
operator.
In practice this might not be an actual problem. The values of the expressions
are different. But perhaps the code with if (x<-5)
does the apparently right
thing today but not tomorrow.
Puns everywhere
R is riddled with puns. A lot package names are something and an "R" put into it. There is "knitr", "tidyr", "stringr". The "knitr" package can generate beautiful reports from R document. These reports are documents with headings and text interleaved with R code. This is my preferred format for things like experimental reports where you want to document something and show off your data.
The methods in this package are called knit
and purl
. I knew the first
word, the activity of converting wool into garments. But "purl" I had to look
up, it seems to be be similar but into the opposite direction. One could argue
that rmd_to_md
and rmd_to_r
would be better descriptions, but it would also
be bourgeois and not funny. The problem with those names is that you cannot
search for them, nor can you remember or guess them easily.
But I won't complain about the "tidyverse" packages because they have fundamentally changed the way I work with data and "ggplot2" is hands down the best plotting system I have ever tried.
Missing arguments
In languages like Python and C++ there is a concept of optional arguments. In R, this also exists, but with a slight twist.
Define function like this in Python:
def py_func(a, b): print(a)
The argument b
is not used within the function, but bear with me.
If you try to call this as py_func(1)
, it will fail loud and clear:
Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: py_func() missing 1 required positional argument: 'b'
What you could do is to define the function with a default value for b
, like
this:
def py_func(a, b=2): print(a)
Now calling it as py_func(1)
works just fine, because this gets called as
py_func(1, 2)
.
In C++ it is exactly the same way, you need to specify all the arguments that have no default value.
In R, this is not the case. When you declare a function analogously to the first example, it would look like this:
r_func <- function(a, b) { print(a) }
Now calling r_func(1)
will work! It will print out the 1
and does not
complain. Only when we call r_func()
without any arguments, it will complain
that the parameter a
was not passed a value and that there is no default
value.
This means that when you write a function, you cannot be certain that the
caller passed something for each parameter. Only when you do some computation
with one of the parameters, it will crash. Sometimes you even want to make a
parameter optional without it having a default value. In languages like Python
or C++ you would have to use some neutral sentinel value like Null
(Python)
or nullptr
(C++). However, there still is a difference between a parameter
not being passed at all and it explicitly having passed the value NULL
(R).
In R there is the function missing
which will check whether the parameter was
missing in the call. Therefore you see in R code things like this:
r_func <- function(a, b) { if (missing(a)) { stop("Parameter `a` is missing and needs to be passed!") } if (missing(b)) { stop("Parameter `b` is missing and needs to be passed!") } }
There is no way to know from the function signature which parameters are mandatory and which are optional. You need to look into the documentation or even the function definition to figure that out.
Data frame access operator
The data.frame
class has two access operators: [
and [[
. The first
selects a subset of a data frame, the second a single element only. The row and
column arguments to the first variant can be vectors with length greater than
one.
You can specify zero, one, or two arguments. And also R allows you to have
arguments missing explicitly or implicitly. Therefore we can write stuff like
df[, ]
which are two implicitly missing arguments. The function
data.frame.[
does all sorts of stuff when you have different arguments.
Let's agree on the following data frame:
df <- data.frame(a = 10:14, b = 15:19, c = 20:24)
The variable df
then contains this:
a b c
1 10 15 20 2 11 16 21 3 12 17 22 4 13 18 23 5 14 19 24
However, we can also obtain this using df[]
and df[, ]
. These give us the
whole data frame.
In order to select two columns from it, we can use df[c(1, 2)]
and obtain a
new data frame with just these columns:
a b
1 10 15 2 11 16 3 12 17 4 13 18 5 14 19
We can also write this as df[, c(1, 2)]
to mean "all rows, columns 1 and 2".
It gives the same result.
In order to get two rows, we do df[c(3, 4), ]
and obtain another data frame:
a b c
3 12 17 22 4 13 18 23
But what happens when we only happen to select a single column this way? When
we use df[1]
, we obtain a data frame:
a
1 10 2 11 3 12 4 13 5 14
But the seemingly equivalent df[, 1]
gives us a vector:
[1] 10 11 12 13 14
You can also get the vector using df$a
with the named column.
We can force it to give us a data frame with another argument,
df[, 1, drop = FALSE]
. Then we get the same result as df[1]
.
Quiz
Here is a little quiz: What are the return types of the following expressions?
- What is the type of
df
? - What is the type of
df[]
? - What is the type of
df[, ]
? - What is the type of
df[1, ]
? - What is the type of
df[1]
? - What is the type of
df[, 1, drop = FALSE]
? - What is the type of
df[c(2, 3), c(1, 2)]
?
It is data.frame
for every one. But for the following, it is a vector!
- What is the type of
df[, 1]
?
Logical AND
Like in C and C++, there is the &
and &&
operators. In R, they are somewhat
similar but different. The &
does a normal element-wise AND-operation. The
&&
only takes the first elements. I do not understand why it makes sense to
define the operation c(T, F) && c(T, T)
to be true, but that seems to be just
another quirk in the list. Why would one have boolean vectors and only work
with the first elements?
Index vector lengths
When you have an array with the numbers from 1 to 10 (inclusive), it has length 10. When you try to index that with a boolean index vector of length 3 in Python NumPy, it will fail:
>>> import numpy as np >>> np.arange(1, 11)[np.arange(1, 4) < 4] Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: boolean index did not match indexed array along dimension 0; dimension is 10 but corresponding boolean dimension is 3
When you do the same thing in R, you get no warning whatsoever:
> (1:10)[c(TRUE, TRUE, FALSE)] [1] 1 2 4 5 7 8 10
It seems that a logical index vector with insufficient length is just padded
with TRUE
. You can also specify an index vector which is longer than the
vector itself, then you get a bunch of NA
values:
> (1:10)[1:30 > 3] [1] 4 5 6 7 8 9 10 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA [26] NA NA
So when you by mistake use an index vector meant for rows for the columns, you do not get any error that would help finding that bug.
Unusual function names and argument orders
In most languages that are somewhat functional there is a map
command, which
takes a function and a container and maps that to a new container.
Mathematically we could write
$$\text{map} \colon (X \to Y) \times { X } \to { Y } \,.$$
In Haskell the signature of the function indeed is this:
map :: (a -> b) -> [a] -> [b]
In the Wolfram Language their Map also works the same way: function as the first argument, container as the second argument. Also in Python the map takes the function and then the iterable. Okay, C++ has std::transform, which takes a pair of input iterators, an output iterator and _then the transformation. But that is so weird to use that it does not really count in this comparison.
And then you have R which does not call it "map" but lapply
. And to make it
worse the argument order is different, so it takes the list _first and the
function _second. This is so irritating when I work on my projection
consisting of both Wolfram Language and R code that I routinely stumble across
this. The mapply
function takes the function as a first argument, and the
apply
function takes the function as the last argument. You can't make this
stuff up, except maybe if you are used to PHP.
Inclusive range
In Python and C++ the ranges to access sequences are inclusive at the beginning
and exclusive at the end. Also indices start from zero, but that is not so
important. If you in Python take seq[4:6]
, you get the elements with indices
4 and 5, but not 6. In R you would also get the element with index 6. Python's
range
function will also not give you the last element, so this matches
nicely.
I needed to traverse the upper triangle of a matrix. So I did this in R:
for (row in 1:size) { for (col in (row + 1):size) { i1 <- (row - 1) * size + col i2 <- (col - 1) * size + row # … }
}
The size
is the number of rows and columns in the square matrix. If size
is
3, then 1:size
is 1:3
and that is the sequence 1, 2, 3. But when row
has
the last value, 3, this gives a problem in the next line. We have
(row + 1):size
, which evaluates to 4:3
. In Python the range(11, 10)
would
just give an empty sequence. But in R it gives 4, 3!. That means that we access
elements that should not be there in the first place.
We therefore need to fix the first row such that we stop one element earlier.
for (row in 1:(size - 1)) { for (col in (row + 1):size) { i1 <- (row - 1) * size + col i2 <- (col - 1) * size + row # … }
}
This fixes that issue, but we now have a problem if size
is just 1. Then
row
will be in the sequence 1, 0, which will get incorrect indices as well.
Therefore one needs another fix here.
if (size > 1) { for (row in 1:(size - 1)) { for (col in (row + 1):size) { i1 <- (row - 1) * size + col i2 <- (col - 1) * size + row # … } }
}
So although the inclusive indexing in R is supposed to make it easier for people to think about ranges, it leads to more cumbersome code than it would in C++ or Python.