Long and Wide Data Format
There are multiple ways of structuring data. Two sensible formats are the long and the wide data format. In this article I want to do the same problem in two different ways and show the relative merits.
As a task, we want to compute $pi$ using Monte Carlo techniques. We want to see the scaling of the precision and accuracy with the number of iterations. For the precision we need to compute an error estimate. This will be done with the bootstrap procedure, though here we can actually sample from the underlying distribution.
For each number of samples $N$ we want to draw $2 N$ numbers from a uniform distribution in the interval $[0, 1]$. Then we use these as $x$ and $y$ values and compute $r^2 = x^2 + y^2$. We count how many points have $r \le 1$ as the accepted points $A$. The ratio $A/N$ will approximate $\pi/4$, because that is the area of a quarter circle with $r = 1$.
In order to estimate the error, we do above prescription $R$ times and then take the standard deviation of the results to get an error estimate. We want to produce a plot which contains the absolute deviation ($A/N - \pi/4$) with error bars versus the number of samples $N$.
We want $N$ to be the following values:
num_samples <- c(500, 1000, 2000, 4000)
$R$ shall be a fixed number always.
num_bootstrap <- 150
Wide format
We first need to generate the data for each $N$.
start_wide <- proc.time() make_numbers <- function(n) { numbers <- runif(2 * n * num_bootstrap) array(numbers, dim = c(2, n, num_bootstrap)) } samples <- sapply(num_samples, make_numbers)
samples
is now a list which contains arrays. Their first dimension is the bootstrap dimension, the second dimension is the actual sample, the last dimension is the $x$-$y$ dimension and contains two elements. We now want to square all the numbers. Then we sum along axis 3 to get $x^2 + y^2$. After that we only have two dimensions left.
From then on, we count how many of the numbers are below 1 in each row to get the $R$ bootstrap estimates for our data.
values <- numeric(length(samples)) errors <- numeric(length(samples)) for (i in 1:length(samples)) { s <- samples[[i]] squared <- apply(s, 1:3, function(x) x^2) summed <- apply(squared, 2:3, sum) less_eq_one <- apply(summed, 1:2, function(x) x < 1) ratio <- apply(less_eq_one, 2, function(x) sum(x) / length(x)) values[[i]] <- mean(ratio) errors[[i]] <- sd(ratio) }
From here on out we can just subtract the actual value and take the absolute value. The time measurement stops here.
accuracy <- abs(values - pi/4.0) end_wide <- proc.time()
Now we can go ahead and plot this.
plot(x = num_samples, y = accuracy)
With the hadron
library we can also plot this with error bars:
library(hadron)
## Loading required package: boot
plotwitherror(x = num_samples, y = accuracy, dy = errors)
The result itself is rather unspectular, but that was not really the point to start with.
Long format
Next we try with the long format. For this, there are a bunch of packages which are really helpful.
library(knitr) library(ggplot2) library(magrittr) library(dplyr)
We start with generating the samples. First we need build a data frame which has one row per $x$-$y$ combination that we want to do measurements with. The number of samples varies, so we cannot use expand.grid
right away but must build the data frame from multiple expand.grid
calls and then glue them together with rbind
.
start_long <- proc.time() f <- function(n) { expand.grid(N = n, r = rep(1:num_bootstrap, each = n)) } long <- do.call(rbind, mapply(f, num_samples, SIMPLIFY = FALSE))
Sampling for $x$ and $y$ is easy now, we just need to add two more columns with the right number of random numbers to the data frame.
long$x <- runif(nrow(long)) long$y <- runif(nrow(long))
This is what the data structure looks like:
kable(head(long))
N | r | x | y |
---|---|---|---|
500 | 1 | 0.3472060 | 0.8126291 |
500 | 1 | 0.4836393 | 0.0258222 |
500 | 1 | 0.4090112 | 0.4127594 |
500 | 1 | 0.9862454 | 0.1251802 |
500 | 1 | 0.3422988 | 0.4724890 |
500 | 1 | 0.0246883 | 0.7189740 |
In order to incorporate the radius criterion we can use the %<>%
operator from magrittr
.
long %<>% mutate(accept = x^2 + y^2 < 1)
Alternatively one could have written this as
long <- mutate(long, accept = x^2 + y^2 < 1)
The current state of the data frame:
kable(head(long))
N | r | x | y | accept |
---|---|---|---|---|
500 | 1 | 0.3472060 | 0.8126291 | TRUE |
500 | 1 | 0.4836393 | 0.0258222 | TRUE |
500 | 1 | 0.4090112 | 0.4127594 | TRUE |
500 | 1 | 0.9862454 | 0.1251802 | TRUE |
500 | 1 | 0.3422988 | 0.4724890 | TRUE |
500 | 1 | 0.0246883 | 0.7189740 | TRUE |
We now want to compute the ratio $A/N$ for each value of $N$ and bootstrap sample and then average over the bootstrap samples. With dplyr
we first do a group_by
, that is like holding the margins in the apply
function. Then we do summarize (which is a reduction of some sort) to compute $A/N$. From there on we group by $N$ again (to remove the $R$ bootstrap samples) and compute mean and standard deviation.
As a last step we add the accuracy
column.
long_res <- long %>% group_by(N, r) %>% summarize(accepted = sum(accept) / n()) %>% group_by(N) %>% summarize(value = mean(accepted), error = sd(accepted)) long_res %<>% mutate(accuracy = abs(value - pi/4.0)) end_long <- proc.time()
This is the resulting data frame:
kable(long_res)
N | value | error | accuracy |
---|---|---|---|
500 | 0.785920 | 0.0201350 | 0.0005218 |
1000 | 0.784820 | 0.0141536 | 0.0005782 |
2000 | 0.786420 | 0.0093601 | 0.0010218 |
4000 | 0.786255 | 0.0063300 | 0.0008568 |
This can nicely be plotted with ggplot2
:
ggplot(long_res, aes(x = N, y = accuracy)) + geom_point() + #scale_x_log10() + #scale_y_log10() + geom_errorbar(aes(ymin = accuracy - error, ymax = accuracy + error)) + labs(title = 'Monte Carlo Error Scaling', x = 'Samples N', y = 'Absolute deviation') + theme_light()
Summary
I find the long format much easier to work with. In case we wanted to add another variable, we can just do that. With the wide format, we would have to add another dimension to the matrix. Also with the wide format we lost track of the meta data along the way. The list of arrays did now know which particular value of $N$ each eleemnt corresponded, we needed to look that up via the index in num_samples
. The long format does not have this problem, all the data is right there.
Contrary to my intuition the wide format version is much slower than the long format version:
end_wide - start_wide
## user system elapsed ## 12.255 0.081 12.488
end_long - start_long
## user system elapsed ## 1.628 0.016 1.661
I have also tried to exchange the order of the dimensions of the array in case the row-major or column-major layout was chosen incorrectly. However that does not seem to have a significant effect.
The memory layout of the wide format should be just perfect for the work that I am doing here, but somehow this is faster, even though we have to call group_by
twice. It would be interesting to find out why the performance is so much better.
There is a price to pay in memory, though. The long format takes up a bunch of memory:
library(pryr)
object_size(long)
## 36 MB
The wide format only contains the data, so that should be smaller. However, we get similar sizes for this particular example.
object_size(samples)
## 18 MB
object_size(summed)
## 4.8 MB
object_size(squared)
## 9.6 MB
object_size(less_eq_one)
## 2.4 MB
object_size(ratio)
## 1.24 kB
The last variables only contain the data for the largest $N$ since they are used in the for
loop. The sum of the other ones is a little less than this. Taking everything together we end up in a similar range.
Conclusion
Although I initially thought of it as inefficient and a waste of memory, I am now quite happy with the long data format for most tasks. It makes working with the data and associated meta data much easier and it seems to be quite efficient in R.