Number Sequence Questions Tried with Deep Learning
As part of IQ tests there are these horrible number sequence tests. I hate them with a passion because they are mathematically illdefined problems. A super simple one would be to take 1, 3, 5, 7, 9 and ask for the next number. One could find this very easy and say that this sequence are the odd numbers and therefore the next number should be 11. But searching at the The OnLine Encyclopedia of Integer Sequences (OEIS) for that exact sequence gives 521 different results! Here are the first ten of them:
Sequence  Prediction 

The odd numbers: $a(n) = 2n + 1$.  11 
Binary palindromes: numbers whose binary expansion is palindromic.  15 
Josephus problem: $a(2n) = 2a(n)1, a(2n+1) = 2a(n)+1$.  11 
Numerators in canonical bijection from positive integers to positive rationals ≤ 1  11 
a(n) = largest base2 palindrome m <= 2n+1 such that every base2 digit of m is <= the corresponding digit of 2n+1; m is written in base 10.  9 
Fractalization of (1 + floor(n/2))  8 or larger 
Self numbers or Colombian numbers (numbers that are not of the form m + sum of digits of m for any m)  20 
Numbers that are palindromic in bases 2 and 10.  33 
Numbers that contain odd digits only.  11 
Number of nth generation triangles in the tiling of the hyperbolic plane by triangles with angles  12 
So there must be an additional hidden constrain in the problem statement. Somehow they want that the person finds the simplest sequence that explains the series and then use that to predict the next number. But nobody ever defined what “simple” means in this context. If one would have a formal definition of the allowed sequence patterns, then these problems would be solvable. As they stand, I deem these problems utterly pointless.
Since I am exploring machine learning with Keras, I wondered whether one could solve this class of problems using these techniques. First I would have to aquire a bunch of these sequence patterns, then generate a bunch of training data and eventually try to train different networks with them. Finally I'd evaluate how good it performs.
Using web search I came up with a bunch of different websites that have these sequence tests. The first one is fibonicci.com, where they also have this very insightful statement:
Lastly the hard test contains 21 difficult questions and beware they are known as some of the hardest number sequences on the internet. Can you solve them all? Good luck! — fibonicci.com
To me this already shows that this is bullshit. What is the metric for “hard” here? We could just take any sequence from OEIS which is super hard to compute and make that a question in an IQ test. I presume that it would not take much to select a series which does not really make sense.
Let me directly make up the hardest sequence ever: 1, 1, 1, 1, 1. What is the next number? Well, it could be just 1, but also 2 or even 3, depending on which rule you take. I would offer 1, 2 and 3 as possible answers. Which one would you pick? If I just picked one of these sequences (say the one continuing with 2) and sayed that 1 and 3 where wrong, this would be called arbitrary. But to me, all these questions are nothing else.
Sampling possible sequences
Easy category
This website gives a few example in the “easy” category. I have marked the next number in parentheses.
 2 4 9 11 16
 30 28 25 21 16
 972 324 108 36 12
 16 22 34 52 76
 123 135 148 160 173
 0.3 0.5 0.8 1.2 1.7
 4 5 7 11 19
 1 2 10 20 100
Let's take a look at these and try some systematics. What are the differences between the numbers?
> x < c(2, 4, 9, 11, 16)
> diff(x)
[1] 2 5 2 5
It seems that they alternate between 2 and 5, so that is easy. The next difference will be a 2, so we just have to add 2 and the result is 18. Then the next one will be five larger, so 23. We can also find this OEIS as “Numbers that are congruent to {2, 4} mod 7”. Using this information we can also create that series ourselves:
> s < seq(1, 100)
> s[s %% 7 %in% c(2, 4)]
[1] 2 4 9 11 16 18 23 25 30 32 37 39 44 46 51 53 58 60 65
[20] 67 72 74 79 81 86 88 93 95 100
Then we take a look at the next one, also taking the differences again:
> x < c(30, 28, 25, 21, 16)
> diff(x)
[1] 2 3 4 5
> diff(diff(x))
[1] 1 1 1
And so differences are decreasing integers, so the next number is six smaller than 16, therefore the answer is 10.
OEIS does not find it, but suggests that the sequence is $$ a_n = − \frac12 n^2 − \frac12 n + 31 \,. $$ As the differences linearly decrease, it is not surprising that the resulting sequence follows a quadratic prescription.
And on to the next one. First we compute the differences.
> x < c(972, 324, 108, 36, 12)
> diff(x)
[1] 1296 432 144 48
That does not seem to help so much as there is a sign flip. So perhaps we should take the differences of the absolute values instead.
> diff(abs(x))
[1] 648 216 72 24
Nope, that does not help either. Let's try to to take the ratio of successive items.
> x / dplyr::lead(x)
[1] 3 3 3 3 NA
Yes, so we have to divide by $3$ to get to the next item. Therefore 4 will be the next one.
For the next one, we need to take the differences of the differences in order to get something useful:
> x < c(16, 22, 34, 52, 76)
> diff(x)
[1] 6 12 18 24
> diff(diff(x))
[1] 6 6 6
The last difference is 24, so the next difference will be 30. And therefore the next number will be 106. OEIS unsurprisingly gives us the prescription $a_n = 3 n^2 − 3 n + 16$ where we can see that the second derivative is a constant 6.
The next in the list is a really boring one, with the differences alternating between 12 and 13:
> x < c(123, 135, 148, 160, 173)
> diff(x)
[1] 12 13 12 13
And then we have another one where the second derivative is constant:
> x < c(0.3, 0.5, 0.8, 1.2, 1.7)
> diff(x)
[1] 0.2 0.3 0.4 0.5
> diff(diff(x))
[1] 0.1 0.1 0.1
Finally we are going to see a new pattern in the next one, it seems:
> x < c(4, 5, 7, 11, 19)
> diff(x)
[1] 1 2 4 8
It is possible that the derivative is just $2^n$, such that the next difference is 16 and therefore we have $19 + 16 = 35$. This is a new pattern that we haven't seen before.
And the next one is also a new pattern.
> x < c(1, 2, 10, 20, 100)
> diff(x)
[1] 1 8 10 80
Here we can either look at the numbers or the derivative. It seems to be that it is just the digits 1 and 2 and then additional zeros added. We could also interpret this as a multiplicative thing where the ratio from one to the next alternates between 2 and 5.
> dplyr::lead(x) / x
[1] 2 5 2 5 NA
Medium category
We could go on and also take a look at the harder problems. In the “medium” category we find more series:
 2 5 4 3 6
 1 4 9 16 25
 75 15 25 5 15
 1 2 6 24 120
 183 305 527 749 961
 16 22 34 58 106
 17 40 61 80 97
For the first one I would think that there are two series zipped together. Both decrease by 2 in each step, the first was started at $2$ and the second at 5. The next number should be a 1, then.
The second are just the square numbers, $n^2$.
In the third one I am pretty lost. The first and second derivative do not give much information here.
> x < c(75, 15, 25, 5, 15)
> diff(x)
[1] 60 10 20 10
> diff(diff(x))
[1] 70 30 30
Also taking the ratios of successive elements does not give much insight at first.
> dplyr::lead(x) / x
[1] 0.200000 1.666667 0.200000 3.000000 NA
Perhaps we have two partial series here that are independent of each other?
> odd < x[seq(1, 5, by = 2)]
[1] 75 25 15
> even < x[seq(2, 5, by = 2)]
[1] 15 5
This does not look really sensible. The possible answers are 7, 3, 5 and 2. Since all the elements so far end with a 5, the likely answer would also be a five. But it is not the correct element. Actually 3 is the correct answer.
Just plotting the data with the additional point gives a pattern that looks like this:
We can see that there are jumps up an down. The jumps up are always the same height (10 more) and the jumps down are getting smaller. So we can combine the information and say that it gets divided by 5 on the even steps and increased by 10 on the odd steps.
The sequence 1, 2, 6, 24, 120 are the factorials with the next element being 720. We can see that by taking the inverse ratio and then the derivative of that.
> x < c(1, 2, 6, 24, 120)
> dplyr::lead(x) / x
[1] 2 3 4 5 NA
> diff(dplyr::lead(x) / x)
[1] 1 1 1 NA
The next one seems to have a rather strange derivative. The inverse ratio does not help much either.
> x < c(183, 305, 527, 749, 961)
> diff(x)
[1] 122 222 222 212
> dplyr::lead(x) / x
[1] 1.666667 1.727869 1.421252 1.283044 NA
Is this a play of digits in the differences there? Actually it is a play of digits in the numbers itself. We can see that the first digits are increasing odd numbers. The second digits are increasing even numbers and the last digits are also increasing odd numbers. They seem to be cyclic. By this approach we will just need to cyclicly increase all the digits by two and end up with 183 again. That is the correct answer. But I have cheated and looked at the answer before figuring out that this was the pattern.
Identified patterns
From the sequences sampled in the above section, we can extract the following list of patterns that occur.
 Differences alternate between two numbers ({2, 5} or {12, 13})
 Difference decreases by one
 Divide by $3$
 Second derivative constant (1 or 6)
 Derivative is $2^n$
 Multiply by alternatingly 2 and 5
 Alternatingly divide by 5 and increase by 10
 Factorial series, multiply by increasing numbers
 Treating the digits as separate series with modular arithmetic
The website jobtestprep.co.uk has an article about these series and gives patterns that are just like the ones that we have identified from the examples.
There likely are even more patterns that one could identify in other examples. As they are mostly arbitrary and whatever the test maker thought to be an intuitive and obvious pattern, we can just take a few of them and see whether we can get a machine learning algorithm to work on them.
Machine learning approaches
It would be rather straightforward to write a program using conventional methods. I would take the first and second derivative, look whether there are evenodd patterns to be detected, take ratios and inverse ratios to find integer numbers in them. This would solve most of the ones already. One would need to come up with a decision tree to look for the patterns. But that would be boring.
Fundamentally I see two different approaches to this problem.

One could try to to predict the next number directly by letting the system learn to match the first five elements to the sixth one. In testing one would just give it a bunch of five element sequences and check whether it produces the desired answer.

Categorize the sequences according to the patterns that may occur. This will be a finite classification problem then. In order to predict the next element one would have to also implement the rules to produce the next number. Having a separate category for increase by one and increase by two would quickly make the space of sequences uncontrollably large. Instead one should rather only have the category and then the prescription would still need to compute the derivative or ratio.
Next one needs to figure out how the data should be encoded. The options would be to just have the data as is, scaled down to floating point numbers in the interval $[1, 1]$. Also one could use the onehot encoding for the numbers, needing a lot of dimensions to make it work properly. I will restrict the series to a range from like 0 to 100 at first such that it does not have such a hard time to get it. I want to start with just having the data as is.
Raw data, plain linear
To start with, I will just use the simplest type of series that I can imagine.
sample_count = 10000
rng = np.random.default_rng()
increase = np.arange(0, 6)
starts = rng.integers(1, 10, sample_count)
slopes = rng.integers(1, 10, sample_count)
sequences = np.atleast_2d(starts).T + np.atleast_2d(slopes).T * np.atleast_2d(increase)
normalization = np.max(sequences)
sequences = sequences / normalization
Dense network
model = keras.models.Sequential()
model.add(keras.layers.Dense(32, input_shape=(5,)))
model.add(keras.layers.Dense(1))
model.compile(optimizer='rmsprop',
loss='mean_absolute_error',
metrics=['mean_squared_error'])
Layer (type) Output Shape Param #
=================================================================
dense_22 (Dense) (None, 32) 192
_________________________________________________________________
dense_23 (Dense) (None, 1) 33
=================================================================
Total params: 225
Trainable params: 225
history = model.fit(
data, target,
epochs=30,
batch_size=20,
validation_split=0.2)
The predictions that come out of the network would need to be rounded to the next integer to make the prediction. I check how many of the deviations have an absolute value of greater than 0.5, which would mean a wrong prediction. That rate is 4.3 %, so not too bad for a simple dense network with 32 features, but also not that good given that the problem could be solved so easily with conventional methods.
Making that 512 features does not help at all. This is the loss for that case:
And the error rate is 76.2 %, so that clearly does not help at all.
Convolution
Next I want to try a convolutional layer. In this example at most two successive elements are connected, therefore a consolution should be sufficient. Something like a GRU or LSTM layer seems overkill. And I know that there will be one pattern which is present globally, so I want to use a global average pooling layer. I know that there are only 9 different slopes put into the data, so I should be able to get away with 16 different filters in the convolutional layer.
model = keras.models.Sequential()
model.add(keras.layers.Conv1D(16, (3,), input_shape=(5, 1)))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(1))
This model seems to perform very well on the data, with only 1.5 % error rate. The mean absolute error seems to be rather stable at 0.5, which is just the acceptable threshold.
Raw data, quadratic forms
I now start to add some quadratic forms to the mix using the following generating code:
starts = rng.integers(0, 10, sample_count)
slopes = rng.integers(0, 10, sample_count)
curvatures = rng.integers(0, 5, sample_count)
sequences = np.atleast_2d(starts).T \
+ np.atleast_2d(slopes).T * np.atleast_2d(increase) \
+ np.atleast_2d(curvatures).T * np.atleast_2d(increase)**2
This way the system needs to learn a curvature. Sadly, this seems to be hard and the error rate is at 89.4 % with a high loss:
I would have thought that a convolution could learn to figure out a slope. With the appropriate kernel it can learn to measure slope: $a_n  a_{n1}$ would be a suitable one. It could also use a kernel like $a_{n1}  2 a_n + a_{n+1}$ to measure curvature. Perhaps the problem was that I have only allowed 16 filters to be measured.
Allowing for 64 filters does not really make it much better, this brings us to 83.9 %.
Perhaps convolutions of length 3 are not sufficient. So I make it try a length of 5. The pooling layer should become useless now. The error rate is still 83.2 %, so that did not help.
So I increase to 512 filters with length 5, perhaps that does the trick. We are now at 99.0 % error rate. The network seems to overfit, although the validation loss looks reasonable close to the training loss.
We have seen this pattern before as well, too many convolutional filters will make the result worse. Perhaps one could improve by training for more epochs, but a simpler model gave better results with less resources.
Conclusions
It seems that just applying a convolution to the data is not sufficient to extract all the interesting features. This surprises me a bit because I would have expected that a convolution could effectively take the first and second derivative of data. The dense network at the end must be able to build the sum of the last data point and first and second derivative to extrapolate the next point. Perhaps the problem is that the convolution does not preserve the last element and one would have to reinject the original information into the final state to have it available via shortcircuit.
Maybe even with those changes the convolution just does not work. One would need to extract features like the first and second derivative, the ratios and other things beforehand. Some of these are nonlinear transformations, the network may not be able to learn these just from the data and one has to provide some features manually.
It could also be that deep learning is not the right approach here and as it should rather be a decision tree from more complicated features, one should rather persue an avenue like that.