Fit Range Determination with Machine Learning

One of the most tedious and error-prone things in my work in Lattice QCD is the manual choice of fit ranges. While reading up on Keras, deep neural networks and machine learning and how experimental the whole field is, I thought about just trying the fit range selection with deep learning.

We have correlation functions $C(t)$ which behave as $\sum_n A_n \exp(-E_n t)$ plus noise. The $E_n$ are the energies of the state $n$, the $A_n$ are the respective amplitudes. We are interested in extracting the smallest of the $E_n$, the ground state energy. We use that for sufficiently large times $t$ the term with the smallest energy dominates the expression. Without loss of generality we say $E_0 < E_1 < \ldots$ and formally write $$ \lim_{t \to \infty} C(t) = A_0 \exp(-E_0 t) \,. $$

By taking the effective mass as defined by $$ m_\text{eff}(t) = - \log\left(\frac{C(t)}{C(t+1)}\right) $$ we get $m_\text{eff}(t) \sim E_0$ in the region of large $t$. There are more subtleties involed (back-propagation, thermal states), which we will ignore here. The effective mass is expected to be constant in some region of the data where $t$ is sufficiently large such that the higher states have decayed; yet the exponentially decaying signal-to-noise-ratio is still sufficiently good. An example for such an effective mass is the following.

Fitting a constant to the effective mass allows to extract the energy of the ground state, $E_0$. In the above image one can see such a manually chosen fit range. It starts after the excited states that come from above have decayed and stops before the noise takes over. Such a pleateau must have all data points statistically compatible with the fitted value, fluctuations shall be $\chi^2$ distributed. This is fancy for saying that most should lie within one error bar, some within two error bars and only very few within three error bars or more to the fitted line.

For my dissertation I have to determine around 500 of these ranges, and it is getting boring rather quickly. Especially after every change in the data, this needs to be re-done. So perhaps after doing a few hundred of them, I could train a neural network to do this work for me? Already at this point I know that even if I should find such a solution to this problem, it would need a lot of vetting from my peers before it would be taken credible. Therefore I will still need to verify all the fit ranges by hand. Still I find it an interesting side project to look at.

For this project I again use a Jupyter Notebook, which is just as great of a platform for Python as R Markdown is for R. I can recommend it over working with a script file in both languages.

Transferring the data

I have all my analysis data in R. Machine learning with Keras is done in Python. So I have used RcppCNPy to export my data from R into the NumPy format. There is a limitation that only 1D and 2D data structures can be exported. Also one needs to keep in mind that R has the column-major layout that FORTRAN uses whereas NumPy uses the row-major layout of C. I transpose the tensor in R using aperm before storing it with npySave.

From my analysis I have a lot of things available, but the neural network can likely only look at the effective mass. The actual correlator varies on large scales and from what I read the neural networks like data that is somewhat normally distributed. I also need to export the uncertainties of each point. The central values may fluctuate around the plateau within their errors. Just looking at the central values is not enough. I need both as input, although I am not sure how values and errors should be fed into the neural network.

I export the data for a particular ensemble only. This might be a problem with generalization to other ensembles, but then the lattice spacing and pion mass would also need to be input to the neural network. I want to keep it simple. On the cA2.60.32 ensemble we always have a time extent of $T = 64$ such that half the time (correlator is symmetric and therefore redunadant) is 32 slices. The resulting data tensor will be of shape $(N, 32, 2)$ for $N = 142$ measurements. 32 time slices and the two features (value, error).

In R I have the transposed structure with shape $(2, 32, N)$. So make sure that I have the data correctly transferred, I make a plot of the 7th correlator in R:

hadron::plotwitherror(
    x = 1:length(all_data[1, , 7]),
    y = all_data[1, , 7],
    dy = all_data[2, , 7])

And then I do the same thing with my NumPy data structure. Keep in mind that R is 1-indexed and Python is 0-indexed.

ax.errorbar(
    np.arange(1, 33),
    data[6, :, 0],
    data[6, :, 1],
    marker='o',
    linestyle='none')

This gives me the same looking plot and I am confident that I have the data just the way that I want it.

The fit ranges (target values) are easier, I have tensor of shape $(N, 2)$ which contains the beginning and end of the fit range as an integer.

Choosing the network model

As the correlator data that I analyze is a time series, there are two options that I already saw covered in the book:

  1. A recurrent neural network (RNN), made with LSTM or GRU layers.

  2. A convolutional neural network (convnet), made with convolutional and pooling layers.

I think that we do not really need that much global information, we want to check locally for a plateau. So we will first start with a convolutional layer. Perhaps later we try the recurrent neural network as well. Luckily Keras is so easy to work with that one can just exchange the building blocks and train the network again.

Encoding the target data

Then we need to figure out a way to encode the target data. Just having two integers is likely not going to work very well. If we were to target only a single integer, we would use a one-hot encoding for the numbers, a softmax activation function and categorical crossentropy as loss function. We have two integers, so perhaps we need to have a non-sequental network graph to generate two one-hot encoded outputs.

Marking the plateau

An alternative would be to mark the plateau region by having the plateau region all 1's and everything around all 0's. The neural network would basically give the chance of a point belonging to a plateau on every single time slice.

This is easily generated from the given data. One just has to be careful that the hadron fit routine takes tmin and tmax being inclusive-inclusive whereas Python slicing takes them to be inclusive-exclusive. Also they are 0-based array indices, therefore just a treatment on the tmin is needed.

target = np.zeros((n_meas, 32))
for i in range(n_meas):
    tmin, tmax = labels[i, :]
    target[i, (tmin-1):tmax] = 1

This encoding then looks like the following with ax.imshow(target):

The training process also needs a loss function and a metric to judge the success. Looking at the documentation for the losses we can see that there are a bunch of them. The categorical crossentropy is not applicable here, so we just try the mean absolute error which is defined as mean(abs(y_true - y_pred)). We therefore get a penality for every point that is marked as a pleateau but should not and vice versa.

For the activation I am not sure what to use, we will just go with a sigmoid function to amplify it towards either a 0 or 1.

In order to measure success in the end I use the false positive and false negatives metric. This way we can see how many of the $142 \times 32 = 4544$ result elements were computed incorrectly and how it is biased.

One problem with this approach certainly is that the fit range does not need to be consecutive. Holes in the fit range could be represented this way, but we do not want to allow this.

One-hot encoding start and end

An alternative approach would be to use one-hot encoding for the start and also for the end. The softmax transformation also has an axis=-1 default argument which means that is just applied by that axis and we can have both start and end in the same result data.

The encoding is straightforward.

target = np.zeros((n_meas, 2, 32))
for i in range(n_meas):
    tmin, tmax = labels[i, :]
    target[i, 0, tmin-1] = 1
    target[i, 1, tmax-1] = 1

And the result is just as expected, here is just the first fit range shown:

For the loss we can use the categorical crossentropy, and the metric will be accuracy. This then tells us how many starts and ends have been determined correctly.

Dense approach

Before trying anything more fancy, we can just go ahead with a simple dense model. Chollet writes that one should try with the simplest model first and then justify the expense of trying more complex models by the simples ones not performing well.

Using marked plateau

The simple dense model that we will try first is defined as such:

network0 = keras.models.Sequential()
network0.add(keras.layers.Dense(128, activation='relu', input_shape=(32, 2)))
network0.add(keras.layers.Flatten())
network0.add(keras.layers.Dense(32, activation='sigmoid'))

network0.compile(optimizer='rmsprop',
                 loss='mean_absolute_error',
                 metrics=[keras.metrics.FalsePositives(),
                          keras.metrics.FalseNegatives()])

The model therefore looks like this after compilation:

Layer (type)                 Output Shape              Param #   
=================================================================
dense_27 (Dense)             (None, 32, 128)           384       
_________________________________________________________________
flatten_16 (Flatten)         (None, 4096)              0         
_________________________________________________________________
dense_28 (Dense)             (None, 32)                131104    
=================================================================
Total params: 131,488
Trainable params: 131,488
Non-trainable params: 0

We then train the network with these options:

history = network0.fit(
    data, target,
    epochs=300,
    batch_size=10,
    validation_split=0.2)

The loss and metric look like this over the epochs:

A mean absolute error of 0.12 means that 12 % of the time slices results are incorrect as the absolute error per slice is either 0.0 or 1.0. And looking at the rate of false positive and false negatives, we see that we have like 8 % false positives and 4 % false negatives.

The encoding of the plateaus shows us that the network has not really learned that much about the data but really just assumes pretty much the same range for most data sets.

Taking the difference between actual and target shows that there are many mistakes and that this model is not that great.

We are not overfitting, perhaps one should just give it more freedom? Not, it does not seem to get any better than the 12 % error rate.

That is the baseline that we would have to beat.

Using one-hot start and end

We can also try this model using the other encoding of the target data. I am not quite sure how that works exactly with the activation because the dense layers cannot have a shape but must be flat. So I try to reshape and then use the softmax activation later on.

network0 = keras.models.Sequential()
network0.add(keras.layers.Dense(128, activation='relu', input_shape=(32, 2)))
network0.add(keras.layers.Flatten())
network0.add(keras.layers.Dense(64, activation='relu'))
network0.add(keras.layers.Reshape((2, 32)))
network0.add(keras.layers.Activation('softmax'))

network0.compile(optimizer='rmsprop',
                 loss='categorical_crossentropy',
                 metrics=['accuracy'])

The results are devastating. It starts to overfit pretty much right from the start:

When just looking at the start of the fit range, it does not look appealing either.

In the difference plot one can see that the start of the fit range is off by a few elements.

Given that 80 % of the data has been used for training and that it is overfitting, this does not look too good. One could try to regularize this model to make the overfitting less pronounced, but I fear that this won't make it any better.

Convolutional approach

The convolutional layer can combine information from the local neighborhood. This makes a lot of sense for finding a pleateau because it should identify parts where the central values have no trend (linear coefficient) but also no curvature (quadratic coefficient).

We also need to somehow make it use the uncertainty as well as the central values. The central values $m_\text{eff}(t)$ may vary around $\Delta m_\text{eff}(t)$, but not much more. Basically $m_\text{eff}(t) \pm \Delta m_\text{eff}(t)$ is the corridor where it may vary. With a 2D convolutional layer the neural network might be able to pick up this information somehow and massage it into features like “constant within errors” and “upwards/downwards trend within errors”.

The target encoding using 1's in the pleateau region and 0's elsewhere seems to make sense here.

The network that I have chosen is the following:

network1 = keras.models.Sequential()
network1.add(keras.layers.Conv2D(64, (3, 2), activation='relu', input_shape=(32, 2, 1)))
#network1.add(keras.layers.Reshape((30, 32)))
#network1.add(keras.layers.MaxPooling1D((2,)))
#network1.add(keras.layers.Conv1D(64, (3,), activation='relu'))
#network1.add(keras.layers.MaxPooling1D((2,)))
#network1.add(keras.layers.Conv1D(64, (3,), activation='relu'))
#network1.add(keras.layers.MaxPooling1D((2,)))
network1.add(keras.layers.Flatten())
network1.add(keras.layers.Dropout(0.3))
network1.add(keras.layers.Dense(128, activation='relu'))
network1.add(keras.layers.Dense(32, activation='sigmoid'))

print(network1.summary())

network1.compile(optimizer='rmsprop',
                 loss='mean_absolute_error',
                 metrics=[keras.metrics.FalsePositives(),
                          keras.metrics.FalseNegatives()])

It starts with a convolutional layer that uses a 3Ă—2 stencil to pick up the error from the other feature dimension. This way it should be able to build stencils that resolve a trend within errors. As there are only linear transformations, it likely cannot do a $t$-test, so we might see limitations.

I let it go directly to a dense classification network in the hope that this would become a somewhat diagonal thing and pick out the applicable stencils that the convolution has learned.

Keras provides the following summary of the model:

Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_39 (Conv2D)           (None, 30, 1, 64)         448       
_________________________________________________________________
flatten_39 (Flatten)         (None, 1920)              0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 1920)              0         
_________________________________________________________________
dense_70 (Dense)             (None, 128)               245888    
_________________________________________________________________
dense_71 (Dense)             (None, 32)                4128      
=================================================================
Total params: 250,464
Trainable params: 250,464
Non-trainable params: 0

The results are slightly worse than with the pure dense network.

From the target-actual-plot I would even think that it shows less individuality for each measurement but treats them mostly the same.

And in the difference plot it also looks disheartening.

Adding all the additional blocks made of convolutional and pooling layers does not improve anything. This does not surprise me as for this problem we don't really need more complicated global features (like with image classification) but rather need the spatial information.

Conclusion

I have now tried a few different models and parameterizations of the data. It feels as if either there is insufficient data to actually solve this problem in a satisfactory fashion or I am not experienced enough to find a good neural network for this problem. I haven't tried the recurrent layers yet, perhaps they also won't work that well.

Last year we had a discussion of this exact problem with machine learning specialists and they deemed this a hard problem. If a simple dense or convolutional network would have been the answer, they likely would have suggested it. Therefore I am happy to have played around with it, but also am willing to just leave it at this for now.

Even if this would reproduce exactly the fit ranges that I have chosen, it would be unclear how the systematic error from chosing the fit range would be treated. The neural network cannot really explain its reasoning like a human could try, so one would be stuck with a sortof black box in the analysis chain.