Long and Wide Data Format

Martin Ueding

2017-12-26

Code & Zahlen

There are multiple ways of structuring data. Two sensible formats are the long and the wide data format. In this article I want to do the same problem in two different ways and show the relative merits.

As a task, we want to compute $pi$ using Monte Carlo techniques. We want to see the scaling of the precision and accuracy with the number of iterations. For the precision we need to compute an error estimate. This will be done with the bootstrap procedure, though here we can actually sample from the underlying distribution.

For each number of samples $N$ we want to draw $2 N$ numbers from a uniform distribution in the interval $[0, 1]$. Then we use these as $x$ and $y$ values and compute $r^2 = x^2 + y^2$. We count how many points have $r \le 1$ as the accepted points $A$. The ratio $A/N$ will approximate $\pi/4$, because that is the area of a quarter circle with $r = 1$.

In order to estimate the error, we do above prescription $R$ times and then take the standard deviation of the results to get an error estimate. We want to produce a plot which contains the absolute deviation ($A/N - \pi/4$) with error bars versus the number of samples $N$.

We want $N$ to be the following values:

num_samples <- c(500, 1000, 2000, 4000)

$R$ shall be a fixed number always.

num_bootstrap <- 150

Wide format

We first need to generate the data for each $N$.

start_wide <- proc.time()

make_numbers <- function(n) {
    numbers <- runif(2 * n * num_bootstrap)
    array(numbers, dim = c(2, n, num_bootstrap))
}

samples <- sapply(num_samples, make_numbers)

samples is now a list which contains arrays. Their first dimension is the bootstrap dimension, the second dimension is the actual sample, the last dimension is the $x$-$y$ dimension and contains two elements. We now want to square all the numbers. Then we sum along axis 3 to get $x^2 + y^2$. After that we only have two dimensions left.

From then on, we count how many of the numbers are below 1 in each row to get the $R$ bootstrap estimates for our data.

values <- numeric(length(samples))
errors <- numeric(length(samples))

for (i in 1:length(samples)) {
    s <- samples[[i]]

    squared <- apply(s, 1:3, function(x) x^2)
    summed <- apply(squared, 2:3, sum)
    less_eq_one <- apply(summed, 1:2, function(x) x < 1)
    ratio <- apply(less_eq_one, 2, function(x) sum(x) / length(x))

    values[[i]] <- mean(ratio)
    errors[[i]] <- sd(ratio)
}

From here on out we can just subtract the actual value and take the absolute value. The time measurement stops here.

accuracy <- abs(values - pi/4.0)

end_wide <- proc.time()

Now we can go ahead and plot this.

plot(x = num_samples, y = accuracy)

With the hadron library we can also plot this with error bars:

library(hadron)

## Loading required package: boot

plotwitherror(x = num_samples, y = accuracy, dy = errors)

The result itself is rather unspectular, but that was not really the point to start with.

Long format

Next we try with the long format. For this, there are a bunch of packages which are really helpful.

library(knitr)
library(ggplot2)
library(magrittr)
library(dplyr)

We start with generating the samples. First we need build a data frame which has one row per $x$-$y$ combination that we want to do measurements with. The number of samples varies, so we cannot use expand.grid right away but must build the data frame from multiple expand.grid calls and then glue them together with rbind.

start_long <- proc.time()

f <- function(n) {
    expand.grid(N = n, r = rep(1:num_bootstrap, each = n))
}

long <- do.call(rbind, mapply(f, num_samples, SIMPLIFY = FALSE))

Sampling for $x$ and $y$ is easy now, we just need to add two more columns with the right number of random numbers to the data frame.

long$x <- runif(nrow(long))
long$y <- runif(nrow(long))

This is what the data structure looks like:

kable(head(long))

N	r	x	y
500	1	0.3472060	0.8126291
500	1	0.4836393	0.0258222
500	1	0.4090112	0.4127594
500	1	0.9862454	0.1251802
500	1	0.3422988	0.4724890
500	1	0.0246883	0.7189740

In order to incorporate the radius criterion we can use the %<>% operator from magrittr.

long %<>% mutate(accept = x^2 + y^2 < 1)

Alternatively one could have written this as

long <- mutate(long, accept = x^2 + y^2 < 1)

The current state of the data frame:

kable(head(long))

N	r	x	y	accept
500	1	0.3472060	0.8126291	TRUE
500	1	0.4836393	0.0258222	TRUE
500	1	0.4090112	0.4127594	TRUE
500	1	0.9862454	0.1251802	TRUE
500	1	0.3422988	0.4724890	TRUE
500	1	0.0246883	0.7189740	TRUE

We now want to compute the ratio $A/N$ for each value of $N$ and bootstrap sample and then average over the bootstrap samples. With dplyr we first do a group_by, that is like holding the margins in the apply function. Then we do summarize (which is a reduction of some sort) to compute $A/N$. From there on we group by $N$ again (to remove the $R$ bootstrap samples) and compute mean and standard deviation.

As a last step we add the accuracy column.

long_res <- long %>%
    group_by(N, r) %>%
    summarize(accepted = sum(accept) / n()) %>%
    group_by(N) %>%
    summarize(value = mean(accepted), error = sd(accepted))

long_res %<>% mutate(accuracy = abs(value - pi/4.0))

end_long <- proc.time()

This is the resulting data frame:

kable(long_res)

N	value	error	accuracy
500	0.785920	0.0201350	0.0005218
1000	0.784820	0.0141536	0.0005782
2000	0.786420	0.0093601	0.0010218
4000	0.786255	0.0063300	0.0008568

This can nicely be plotted with ggplot2:

ggplot(long_res, aes(x = N, y = accuracy)) +
    geom_point() +
    #scale_x_log10() +
    #scale_y_log10() +
    geom_errorbar(aes(ymin = accuracy - error, ymax = accuracy + error)) +
    labs(title = 'Monte Carlo Error Scaling',
         x = 'Samples N',
         y = 'Absolute deviation') +
    theme_light()

Summary

I find the long format much easier to work with. In case we wanted to add another variable, we can just do that. With the wide format, we would have to add another dimension to the matrix. Also with the wide format we lost track of the meta data along the way. The list of arrays did now know which particular value of $N$ each eleemnt corresponded, we needed to look that up via the index in num_samples. The long format does not have this problem, all the data is right there.

Contrary to my intuition the wide format version is much slower than the long format version:

end_wide - start_wide

##    user  system elapsed 
##  12.255   0.081  12.488

end_long - start_long

##    user  system elapsed 
##   1.628   0.016   1.661

I have also tried to exchange the order of the dimensions of the array in case the row-major or column-major layout was chosen incorrectly. However that does not seem to have a significant effect.

The memory layout of the wide format should be just perfect for the work that I am doing here, but somehow this is faster, even though we have to call group_by twice. It would be interesting to find out why the performance is so much better.

There is a price to pay in memory, though. The long format takes up a bunch of memory:

library(pryr)

object_size(long)

## 36 MB

The wide format only contains the data, so that should be smaller. However, we get similar sizes for this particular example.

object_size(samples)

## 18 MB

object_size(summed)

## 4.8 MB

object_size(squared)

## 9.6 MB

object_size(less_eq_one)

## 2.4 MB

object_size(ratio)

## 1.24 kB

The last variables only contain the data for the largest $N$ since they are used in the for loop. The sum of the other ones is a little less than this. Taking everything together we end up in a similar range.

Conclusion

Although I initially thought of it as inefficient and a waste of memory, I am now quite happy with the long data format for most tasks. It makes working with the data and associated meta data much easier and it seems to be quite efficient in R.