Martin Ueding (Posts about Machine Learning)https://martin-ueding.de/enContents © 2020 <a href="mailto:mu@martin-ueding.de">Martin Ueding</a>
<p><a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/80x15.png" /></a> This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.</p>
Fri, 30 Oct 2020 13:34:00 GMTNikola (getnikola.com)http://blogs.law.harvard.edu/tech/rss- Clusting Recorded Routeshttps://martin-ueding.de/posts/clustering-recorded-routes/Martin Ueding<div><p>I record a bunch of my activities with Strava. And there are novel routes that I try out and only have done once. The other part are routes that I do more than once. The thing that I am missing on Strava is a comparison of similar routes. It has segments, but I would have to make my whole commute one segment in order to see how I fare on it.</p>
<p>So what I would like to try here is to use a clustering algorithm to automatically identify clusters of similar rides. And also I would like find rides that have the same start and end point, but different routes in between. In my machine learning book I read that there are clustering algorithms, so this is the project that I would like to apply them to.</p>
<p>Incidentally Strava <a href="https://www.strava.com/apps">features a lot of apps</a>, so I had a look but could not find what I was looking for. Instead I want to program this myself in Python. One can export the data from Strava and obtains a ZIP file with all the GPX files corresponding to my activities.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/export_zip.png"></p>
<!-- END_TEASER -->
<h2 id="reading-the-data">Reading the data</h2>
<p>The activities are just numbered, so one needs to look up the meta data in <code>activities.csv</code>, which looks like this:</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/activities_csv.png"></p>
<p>I might resolve that at some point, but at first the labeling of the data is irrelevant.</p>
<p>The coordinates given in the GPX data is latitude and longitude. I always get them confused as I rather think in North-South and West-East or spherical coordinates $\theta$ and $\phi$. So the coordinates that we have here are these:</p>
<table>
<thead>
<tr>
<th>Term</th>
<th>Symbol</th>
<th>Direction</th>
</tr>
</thead>
<tbody>
<tr>
<td>Latitude</td>
<td>$\phi$</td>
<td>North-South</td>
</tr>
<tr>
<td>Longitude</td>
<td>$\lambda$</td>
<td>West-East</td>
</tr>
</tbody>
</table>
<p>When one knows the earth radius one can provide an approximate mapping to $x$, $y$ and possibly $z$ coordinates. I can easily provide this assuming a fixed radius of the earth. But as I am only moving around in a rather localized environment, I could just approximate with a tangent plane and take latitude and longitude as coordinates. This will effectively be a <a href="https://en.wikipedia.org/wiki/Mercator_projection">Mercator projection</a>, which we are used to anyway. Distances are not exactly what they seem and need to be corrected with a cosine factor depending on the latitude. For comparing points with each other, this is sufficient.</p>
<p>As all the machine learning libraries are implemented in Python, I need to load the data into Python. There is the <a href="https://github.com/tkrajina/gpxpy">gpxpy library</a> which can read it in the GPX files from Strava. It provides a nice list, and that is all I need.</p>
<h2 id="clustering-by-start-or-end-points">Clustering by start or end points</h2>
<p>At first I have only loaded 10 data points. And plot latitude and longitude, not worrying about the non-distance preserving projection.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/first_plot.png"></p>
<p>I then zoom in onto the region where I deem the largest cluster, that likely is the Bonn cluster.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/first_plot_zoomed.png"></p>
<p>In my machine learning book it has a section of clustering algorithms. I will first try the K-Means and then the DBSCAN to see how they perform on this problem.</p>
<h3 id="k-means">K-Means</h3>
<p>The first clustering algorithm that I read about is the K-Means one. It is implemented in SciKit-Learn, but has the drawback that one needs to know the number of clusters beforehand. I take the above subset of data and let it cluster into 5 clusters, as these are the ones that I can see by eye. I color them and it works just as one would expect.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/subset_clustered.png"></p>
<p>In the meantime I have loaded the full data set, this looks like that:</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/all_plot.png"></p>
<p>One can see the big cluster around Bonn, then there is a cluster in the Netherlands, around København (DK), Spain and Italy. Also the two clusters in China can be seen. What happens now when I let it find five clusters? The clusters are (0) Germany and Netherlands, (1) Beijing, (2) Spain and Italy, (3) København and (4) Wuhan.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/all_clusters5.png"></p>
<p>That's nice, but I would like to find clusters on a fixed say 200 m scale, and not just the top $n$ clusters. But let us try to explore this a bit more. Letting it find 10 clusters, one can see that Spain (green) and Italy (gray) have been split up. Also the Netherlands (pink) were split off. And the main cluster shows that there is a chunk around Bonn and then also more in other cities north of that.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/clusters10.png"></p>
<p>Zooming into the Bonn region again, we can see how that is still just one cluster. And although I am not perfectly sure, I think that I can make out a few points of interest just from the clusters.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/clusters10_Bonn.png"></p>
<h3 id="dbscan">DBSCAN</h3>
<p>The DBSCAN algorithm defines clusters in a much more fitting way. Instead of trying to find $k$ clusters, it uses a measure of distance $\epsilon$ and a minimum number of samples $n$ to define <em>core instances</em>. A core instance is a data point which has $n$ other instances within a distance of $\epsilon$. Core instances may be connected into the same cluster.</p>
<p>I have first tried $\epsilon = 1.0$ and $n = 3$, as I am not sure how that works. I presume that it uses $\epsilon \leq \sqrt{\phi^2 + \lambda^2}$, but I am not sure. In that case a whole degree is super coarse grained. But it works much better than before. See how there are three distinct regions in the main clustering region now. The algorithm also puts anomalies when there are no neighbors. I now actually think that I misinterpreted the points before. The single anomaly in the north is Ireland, the red stuff is the Utrecht and Holland, the brown one is Groningen. The one in Italy is an anomaly (because I just recorded one there), whereas in Spain I have multiple ones.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/dbscan_eps1.png"></p>
<p>This is very promising! I will need to decrease the $\epsilon$ value somehow. Looking at the map around my flat I feel that $\epsilon = 0.00350$ would make for a couple hundred meters. And indeed, using this as a measure of distance I get so many clusters that the legend becomes pointless.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/dbscan_eps0035.png"></p>
<p>Zooming into the Bonn region, I can see that there are more sensible clusters now. The big green one likely is my place of living, and then there are other ones that I cannot exactly make out. It will be interesting to see this combined with a map at some point.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/dbscan_eps0035_Bonn.png"></p>
<p>There are now 23 clusters. One of them makes up 149 observations, and there are 88 anomalies. That sounds like a pretty good start! Perhaps I need to decrease the distance measure a little bit or increase the number of observations before it counts as a cluster. Using the same $\epsilon$ but $n = 5$ for instances reduces the number of clusters to 13, the largest cluster still has 149 observations but there are now 124 anomalies. Perhaps this is even better.</p>
<p>So far I have just thrown start and end point together. This would allow to identify points where I have either started or finished, but it does not really allow to classify actual routes. So I have changed the data from two numbers (latitude and longitude) to four (lat/lon for start/finish). This way it becomes a four dimensional clustering problem and I cannot visualize it neatly any more.</p>
<p>I can show the starting points of the new clusters. Then one can see that there are ones where apparently I have started at home and ended at home, and ones where I have started at home but ended at the other cluster of points. All other points became anomalies now. There are only 7 clusters in the whole data set now.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/dbscan_start_end.png"></p>
<h2 id="metric-for-tracks">Metric for tracks</h2>
<p>Clustering by the start and end points is neat, I so have for instance all commutes to work in one cluster. But what happens with the ones where I made a detour? Sometimes I try slightly different paths on a commute and I would like to cluster them on a finer level. For this we need to take a look at the whole route, not just at the start and end point.</p>
<p>The algorithms need to know the distance between elements. So I need to define a metric in the space of whole tracks. The intuitive way is to just plot a pair of tracks on a map and look at it. The following picture shows two different tracks that were recorded on the same route. There are slight deviations from GPS inaccuracies. But we would still say that this was the same route.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/two_tracks_color.png"></p>
<p><em>Map from Open Street Map. Track visualized in Viking.</em></p>
<p>In my program we don't have the pattern recognition machine that is our brain. So there is a need for something that can actually be computed. Zooming in shows the actual points in the track. As a metric, it seems sensible to compute the distance from each point to the other track. Just trying to use the points in order does not work because sometimes I walk slower or faster, and then the distances would be much larger. Also the number of points in the recording does not always match, and sometimes different programs use different recording intervals. One really needs to match the points with each other.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/two_tracks_color_zoom.png"></p>
<p><em>Map from Open Street Map. Track visualized in Viking.</em></p>
<p>For this we need to compute the distance from every point of the green track to the orange track in the form of a distance matrix
$$
D_{ij} = \sqrt{
\cos\left( \frac{\phi_i^{(1)} - \phi_j^{(2)}}{2} \right)^2
(\lambda_i^{(1)} - \lambda_j^{(2)})^2
+ (\phi_i^{(1)} - \phi_j^{(2)})^2
} \,.
$$
The cosine term is needed because differences in longitude $\lambda$ mean different linear distance depending on the latitude $\phi$. This likely won't pose a big problem, but it is cleaner to include here.</p>
<p>I take the minimum value from each row, and also from each column. These are the minimum distances of each track to the other one. Then I take the sum of both means to yield the distance. This can be multiplied with the earth radius to give the average distance between corresponding points in the track. As a mathematical expression, this would be
$$
g_{1,2} = r_{\mathrm{Earth}} \cdot (\mathrm{mean}<em ij>i \min_j D</em> + \mathrm{mean}<em ij>j \min_i D</em>) \,.
$$</p>
<p>For the given two walks I obtain a distance measure of around 0.6 m, which makes sense. For two random ones I obtain like 1000 km, which isn't surprising with one being in Denmark and the other one in Bonn.</p>
<p>The <a href="https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html">documentation of DBSCAN</a> states that one can supply a precomputed distance matrix. This way we can just pass our custom metric $g_{ab}$ that we have defined above.</p>
<p>The metric is implemented using NumPy arrays and seems to take 31 ms per call, so with currently 224 activities it would take like 20 min to compute it all. That seems wasteful as we already have the clusters from the start and end points. We can then use all the tracks within one cluster of start and end points to compute the distance, doing a sub-cluster detection using that data. The problem is that within each cluster there are still a lot of activities. In the cluster around my home there are more than half of the activities, which is not really a surprise.</p>
<p>I have tried to thin out the points before feeding them into the metric. This does not work, because then the average distance between the points increases and the metric value becomes less meaningful. So perhaps there is no way to save processing power there. Computing the distance between all tracks in a cluster is an $\mathrm O(n^2)$ operation and quite lengthy.</p>
<h2 id="sub-clustered-tracks">Sub-clustered tracks</h2>
<p>Using the above procedure I have created sub-clusters from each start-end cluster using an $\epsilon = 5.0\,\mathrm{m}$. The clusters look somewhat sensible.</p>
<p>For instance a round course that I do through the woods. I have done it slightly differently twice, and the algorithm has identified those.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/cluster1.jpg"></p>
<p>Or I have another round that I do to the Rhine river. There it spotted two instances of that.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/cluster2.jpg"></p>
<p>Then I have another through the fields, which apparently I have recorded three times.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/cluster3.jpg"></p>
<p>The most instances are in the walk around the field near the air strip in Hangelar. There one can see that they are mostly similar, though one time we did a detour along the main road.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/cluster4.jpg"></p>
<p>And the walk through the field right next to my door.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/cluster5.jpg"></p>
<p>I would find that it works reasonable well, depending on the intention.</p>
<h2 id="refining-the-metric">Refining the metric</h2>
<p>One could however argue that the metric is not fine-grained enough. One can see that often routes are put into the same cluster if they have some of the path in common. These segments bring down the average distance and allow for larger deviations in other parts. This will make some sort of similarity in routes. There is always something in common such that the distance is measured below the threshold.</p>
<p>But this is not what I had mind. I actually would like to have each distinct shape in its own category. So instead of taking the mean distance between points, I would rather go for the maximum. This would the largest deviation in the route. The problem is that sometimes the GPS produces one or two outlier points. Therefore it would be better to not take the maximum, but like the third largest distance and use that as a metric. This would be outlier safe. Otherwise one could use some particular quantile. Taking the sum of distances will not work due to the noise or systematic shifts. These would just build up as the routes get longer, giving a rather bad number. Using <code>np.partition</code> I can get the second largest element from each and use the sum as a metric.</p>
<p>Using this new definition it can distinguish between routes where I have taken a detour and the ones where I have not.</p>
<p>So from the walks to the air field in Hangelar now only these are within the same cluster. And the one with the little detour is gone, just like I wanted it.</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/cluster_new1.png"></p>
<h2 id="conclusion">Conclusion</h2>
<p>It seems possible to automatically find clusters of whole tracks in this way. The performance with the metric that I have defined is not so awesome, and also the $\mathrm O(n^2)$ runtime is not perfect. Having something of order $\mathrm O(n \log n)$ would be much preferred, but then I would have to find something that would allow me to prevent comparing every track in a cluster with every other one.</p>
<p>Using the now determined clusters and sub-clusters I could do more analysis or create just a list of links. The clusters could be given names such that it would be more meaningful.</p>
<p>Strava actually offers this as a paid feature I just realized. Apparently there is some interest in it!</p>
<p><img alt="" src="https://martin-ueding.de/posts/clustering-recorded-routes/strava_ad.png"></p>
<p>I might embellish this in the future. This has been a nice project to try out clustering algorithms.</p></div>EnglishGPXMachine LearningPythonSciKit-LearnStravahttps://martin-ueding.de/posts/clustering-recorded-routes/Sat, 25 Jul 2020 22:00:00 GMT
- Number Sequence Questions Tried with Deep Learninghttps://martin-ueding.de/posts/number-sequence-tests/Martin Ueding<div><p>As part of IQ tests there are these horrible number sequence tests. I hate them with a passion because they are mathematically ill-defined problems. A super simple one would be to take 1, 3, 5, 7, 9 and ask for the next number. One could find this very easy and say that this sequence are the odd numbers and therefore the next number should be 11. But searching at the <a href="https://oeis.org/search?q=1+3+5+7+9&sort=&language=english&go=Search">The On-Line Encyclopedia of Integer Sequences</a> (OEIS) for that exact sequence gives 521 different results! Here are the first ten of them:</p>
<table>
<thead>
<tr>
<th>Sequence</th>
<th>Prediction</th>
</tr>
</thead>
<tbody>
<tr>
<td>The odd numbers: $a(n) = 2n + 1$.</td>
<td>11</td>
</tr>
<tr>
<td>Binary palindromes: numbers whose binary expansion is palindromic.</td>
<td>15</td>
</tr>
<tr>
<td>Josephus problem: $a(2n) = 2a(n)-1, a(2n+1) = 2a(n)+1$.</td>
<td>11</td>
</tr>
<tr>
<td>Numerators in canonical bijection from positive integers to positive rationals ≤ 1</td>
<td>11</td>
</tr>
<tr>
<td>a(n) = largest base-2 palindrome m <= 2n+1 such that every base-2 digit of m is <= the corresponding digit of 2n+1; m is written in base 10.</td>
<td>9</td>
</tr>
<tr>
<td>Fractalization of (1 + floor(n/2))</td>
<td>8 or larger</td>
</tr>
<tr>
<td>Self numbers or Colombian numbers (numbers that are not of the form m + sum of digits of m for any m)</td>
<td>20</td>
</tr>
<tr>
<td>Numbers that are palindromic in bases 2 and 10.</td>
<td>33</td>
</tr>
<tr>
<td>Numbers that contain odd digits only.</td>
<td>11</td>
</tr>
<tr>
<td pi_2_="pi/2," pi_3_="pi/3,">Number of n-th generation triangles in the tiling of the hyperbolic plane by triangles with angles </td>
<td>12</td>
</tr>
</tbody>
</table>
<p>So there must be an additional hidden constrain in the problem statement. Somehow they want that the person finds the <em>simplest</em> sequence that explains the series and then use that to predict the next number. But nobody ever defined what “simple” means in this context. If one would have a formal definition of the allowed sequence patterns, then these problems would be solvable. As they stand, I deem these problems utterly pointless.</p>
<p>Since I am exploring machine learning with Keras, I wondered whether one could solve this class of problems using these techniques. First I would have to aquire a bunch of these sequence patterns, then generate a bunch of training data and eventually try to train different networks with them. Finally I'd evaluate how good it performs.</p>
<!-- END_TEASER -->
<p>Using web search I came up with a bunch of different websites that have these sequence tests. The first one is <a href="https://www.fibonicci.com/numerical-reasoning/number-sequences-test/">fibonicci.com</a>, where they also have this very insightful statement:</p>
<blockquote>
<p>Lastly the hard test contains 21 difficult questions and beware they are known as some of the hardest number sequences on the internet. Can you solve them all? Good luck! — <a href="https://www.fibonicci.com/numerical-reasoning/number-sequences-test/">fibonicci.com</a></p>
</blockquote>
<p>To me this already shows that this is bullshit. What is the metric for “hard” here? We could just take any sequence from OEIS which is super hard to compute and make that a question in an IQ test. I presume that it would not take much to select a series which does not really make sense.</p>
<p>Let me directly make up the hardest sequence ever: 1, 1, 1, 1, 1. What is the next number? Well, it could be <a href="https://oeis.org/A000012">just 1</a>, but <a href="https://oeis.org/A027907">also 2</a> or <a href="https://oeis.org/A028234">even 3</a>, depending on which rule you take. I would offer 1, 2 and 3 as possible answers. Which one would you pick? If I just picked one of these sequences (say the one continuing with 2) and sayed that 1 and 3 where wrong, this would be called arbitrary. But to me, all these questions are nothing else.</p>
<h2 id="sampling-possible-sequences">Sampling possible sequences</h2>
<h3 id="easy-category">Easy category</h3>
<p>This website gives a few example in the “easy” category. I have marked the next number in parentheses.</p>
<ul>
<li>2 4 9 11 16</li>
<li>30 28 25 21 16</li>
<li>-972 324 -108 36 -12</li>
<li>16 22 34 52 76</li>
<li>123 135 148 160 173</li>
<li>0.3 0.5 0.8 1.2 1.7</li>
<li>4 5 7 11 19</li>
<li>1 2 10 20 100</li>
</ul>
<p>Let's take a look at these and try some systematics. What are the differences between the numbers?</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">x</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span> <span class="m">4</span><span class="p">,</span> <span class="m">9</span><span class="p">,</span> <span class="m">11</span><span class="p">,</span> <span class="m">16</span><span class="p">)</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">2</span> <span class="m">5</span> <span class="m">2</span> <span class="m">5</span>
</code></pre>
<p>It seems that they alternate between 2 and 5, so that is easy. The next difference will be a 2, so we just have to add 2 and the result is 18. Then the next one will be five larger, so 23. We can also find this <a href="https://oeis.org/A047348">OEIS</a> as “Numbers that are congruent to {2, 4} mod 7”. Using this information we can also create that series ourselves:</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">s</span> <span class="o"><-</span> <span class="nf">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">100</span><span class="p">)</span>
<span class="o">></span> <span class="n">s</span><span class="p">[</span><span class="n">s</span> <span class="o">%%</span> <span class="m">7</span> <span class="o">%in%</span> <span class="nf">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span> <span class="m">4</span><span class="p">)]</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">2</span> <span class="m">4</span> <span class="m">9</span> <span class="m">11</span> <span class="m">16</span> <span class="m">18</span> <span class="m">23</span> <span class="m">25</span> <span class="m">30</span> <span class="m">32</span> <span class="m">37</span> <span class="m">39</span> <span class="m">44</span> <span class="m">46</span> <span class="m">51</span> <span class="m">53</span> <span class="m">58</span> <span class="m">60</span> <span class="m">65</span>
<span class="p">[</span><span class="m">20</span><span class="p">]</span> <span class="m">67</span> <span class="m">72</span> <span class="m">74</span> <span class="m">79</span> <span class="m">81</span> <span class="m">86</span> <span class="m">88</span> <span class="m">93</span> <span class="m">95</span> <span class="m">100</span>
</code></pre>
<p>Then we take a look at the next one, also taking the differences again:</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">x</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">30</span><span class="p">,</span> <span class="m">28</span><span class="p">,</span> <span class="m">25</span><span class="p">,</span> <span class="m">21</span><span class="p">,</span> <span class="m">16</span><span class="p">)</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">-2</span> <span class="m">-3</span> <span class="m">-4</span> <span class="m">-5</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">-1</span> <span class="m">-1</span> <span class="m">-1</span>
</code></pre>
<p>And so differences are decreasing integers, so the next number is six smaller than 16, therefore the answer is 10.</p>
<p>OEIS does not find it, but suggests that the sequence is
$$ a_n = − \frac12 n^2 − \frac12 n + 31 \,. $$
As the differences linearly decrease, it is not surprising that the resulting sequence follows a quadratic prescription.</p>
<p>And on to the next one. First we compute the differences.</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">x</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">-972</span><span class="p">,</span> <span class="m">324</span><span class="p">,</span> <span class="m">-108</span><span class="p">,</span> <span class="m">36</span><span class="p">,</span> <span class="m">-12</span><span class="p">)</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">1296</span> <span class="m">-432</span> <span class="m">144</span> <span class="m">-48</span>
</code></pre>
<p>That does not seem to help so much as there is a sign flip. So perhaps we should take the differences of the absolute values instead.</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">-648</span> <span class="m">-216</span> <span class="m">-72</span> <span class="m">-24</span>
</code></pre>
<p>Nope, that does not help either. Let's try to to take the ratio of successive items.</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">x</span> <span class="o">/</span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">lead</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">-3</span> <span class="m">-3</span> <span class="m">-3</span> <span class="m">-3</span> <span class="kc">NA</span>
</code></pre>
<p>Yes, so we have to divide by $-3$ to get to the next item. Therefore 4 will be the next one.</p>
<p>For the next one, we need to take the differences of the differences in order to get something useful:</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">x</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">16</span><span class="p">,</span> <span class="m">22</span><span class="p">,</span> <span class="m">34</span><span class="p">,</span> <span class="m">52</span><span class="p">,</span> <span class="m">76</span><span class="p">)</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">6</span> <span class="m">12</span> <span class="m">18</span> <span class="m">24</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">6</span> <span class="m">6</span> <span class="m">6</span>
</code></pre>
<p>The last difference is 24, so the next difference will be 30. And therefore the next number will be 106. OEIS unsurprisingly gives us the prescription $a_n = 3 n^2 − 3 n + 16$ where we can see that the second derivative is a constant 6.</p>
<p>The next in the list is a really boring one, with the differences alternating between 12 and 13:</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">x</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">123</span><span class="p">,</span> <span class="m">135</span><span class="p">,</span> <span class="m">148</span><span class="p">,</span> <span class="m">160</span><span class="p">,</span> <span class="m">173</span><span class="p">)</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">12</span> <span class="m">13</span> <span class="m">12</span> <span class="m">13</span>
</code></pre>
<p>And then we have another one where the second derivative is constant:</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">x</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">0.3</span><span class="p">,</span> <span class="m">0.5</span><span class="p">,</span> <span class="m">0.8</span><span class="p">,</span> <span class="m">1.2</span><span class="p">,</span> <span class="m">1.7</span><span class="p">)</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">0.2</span> <span class="m">0.3</span> <span class="m">0.4</span> <span class="m">0.5</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">0.1</span> <span class="m">0.1</span> <span class="m">0.1</span>
</code></pre>
<p>Finally we are going to see a new pattern in the next one, it seems:</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">x</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">4</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">7</span><span class="p">,</span> <span class="m">11</span><span class="p">,</span> <span class="m">19</span><span class="p">)</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">1</span> <span class="m">2</span> <span class="m">4</span> <span class="m">8</span>
</code></pre>
<p>It is possible that the derivative is just $2^n$, such that the next difference is 16 and therefore we have $19 + 16 = 35$. This is a new pattern that we haven't seen before.</p>
<p>And the next one is also a new pattern.</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">x</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="m">10</span><span class="p">,</span> <span class="m">20</span><span class="p">,</span> <span class="m">100</span><span class="p">)</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">1</span> <span class="m">8</span> <span class="m">10</span> <span class="m">80</span>
</code></pre>
<p>Here we can either look at the numbers or the derivative. It seems to be that it is just the digits 1 and 2 and then additional zeros added. We could also interpret this as a multiplicative thing where the ratio from one to the next alternates between 2 and 5.</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">lead</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">/</span> <span class="n">x</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">2</span> <span class="m">5</span> <span class="m">2</span> <span class="m">5</span> <span class="kc">NA</span>
</code></pre>
<h3 id="medium-category">Medium category</h3>
<p>We could go on and also take a look at the harder problems. In the “medium” category we find more series:</p>
<ul>
<li>-2 5 -4 3 -6</li>
<li>1 4 9 16 25</li>
<li>75 15 25 5 15</li>
<li>1 2 6 24 120</li>
<li>183 305 527 749 961</li>
<li>16 22 34 58 106</li>
<li>17 40 61 80 97</li>
</ul>
<p>For the first one I would think that there are two series zipped together. Both decrease by 2 in each step, the first was started at $-2$ and the second at 5. The next number should be a 1, then.</p>
<p>The second are just the square numbers, $n^2$.</p>
<p>In the third one I am pretty lost. The first and second derivative do not give much information here.</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">x</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">75</span><span class="p">,</span> <span class="m">15</span><span class="p">,</span> <span class="m">25</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="m">15</span><span class="p">)</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">-60</span> <span class="m">10</span> <span class="m">-20</span> <span class="m">10</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">70</span> <span class="m">-30</span> <span class="m">30</span>
</code></pre>
<p>Also taking the ratios of successive elements does not give much insight at first.</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">lead</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">/</span> <span class="n">x</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">0.200000</span> <span class="m">1.666667</span> <span class="m">0.200000</span> <span class="m">3.000000</span> <span class="kc">NA</span>
</code></pre>
<p>Perhaps we have two partial series here that are independent of each other?</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">odd</span> <span class="o"><-</span> <span class="n">x</span><span class="p">[</span><span class="nf">seq</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="n">by</span> <span class="o">=</span> <span class="m">2</span><span class="p">)]</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">75</span> <span class="m">25</span> <span class="m">15</span>
<span class="o">></span> <span class="n">even</span> <span class="o"><-</span> <span class="n">x</span><span class="p">[</span><span class="nf">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span> <span class="m">5</span><span class="p">,</span> <span class="n">by</span> <span class="o">=</span> <span class="m">2</span><span class="p">)]</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">15</span> <span class="m">5</span>
</code></pre>
<p>This does not look really sensible. The possible answers are 7, 3, 5 and 2. Since all the elements so far end with a 5, the likely answer would also be a five. But it is not the correct element. Actually 3 is the correct answer.</p>
<p>Just plotting the data with the additional point gives a pattern that looks like this:</p>
<p><img alt="" src="https://martin-ueding.de/posts/number-sequence-tests/Bildschirmfoto_20200608-14:13:21-e7b-Auswahl.png"></p>
<p>We can see that there are jumps up an down. The jumps up are always the same height (10 more) and the jumps down are getting smaller. So we can combine the information and say that it gets divided by 5 on the even steps and increased by 10 on the odd steps.</p>
<p>The sequence 1, 2, 6, 24, 120 are the factorials with the next element being 720. We can see that by taking the inverse ratio and then the derivative of that.</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">x</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span> <span class="m">2</span><span class="p">,</span> <span class="m">6</span><span class="p">,</span> <span class="m">24</span><span class="p">,</span> <span class="m">120</span><span class="p">)</span>
<span class="o">></span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">lead</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">/</span> <span class="n">x</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">2</span> <span class="m">3</span> <span class="m">4</span> <span class="m">5</span> <span class="kc">NA</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="n">dplyr</span><span class="o">::</span><span class="nf">lead</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">/</span> <span class="n">x</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">1</span> <span class="m">1</span> <span class="m">1</span> <span class="kc">NA</span>
</code></pre>
<p>The next one seems to have a rather strange derivative. The inverse ratio does not help much either.</p>
<pre class="code literal-block"><span></span><code><span class="o">></span> <span class="n">x</span> <span class="o"><-</span> <span class="nf">c</span><span class="p">(</span><span class="m">183</span><span class="p">,</span> <span class="m">305</span><span class="p">,</span> <span class="m">527</span><span class="p">,</span> <span class="m">749</span><span class="p">,</span> <span class="m">961</span><span class="p">)</span>
<span class="o">></span> <span class="nf">diff</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">122</span> <span class="m">222</span> <span class="m">222</span> <span class="m">212</span>
<span class="o">></span> <span class="n">dplyr</span><span class="o">::</span><span class="nf">lead</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">/</span> <span class="n">x</span>
<span class="p">[</span><span class="m">1</span><span class="p">]</span> <span class="m">1.666667</span> <span class="m">1.727869</span> <span class="m">1.421252</span> <span class="m">1.283044</span> <span class="kc">NA</span>
</code></pre>
<p>Is this a play of digits in the differences there? Actually it is a play of digits in the numbers itself. We can see that the first digits are increasing odd numbers. The second digits are increasing even numbers and the last digits are also increasing odd numbers. They seem to be cyclic. By this approach we will just need to cyclicly increase all the digits by two and end up with 183 again. That is the correct answer. But I have cheated and looked at the answer before figuring out that this was the pattern.</p>
<h3 id="identified-patterns">Identified patterns</h3>
<p>From the sequences sampled in the above section, we can extract the following list of patterns that occur.</p>
<ul>
<li>Differences alternate between two numbers ({2, 5} or {12, 13})</li>
<li>Difference decreases by one</li>
<li>Divide by $-3$</li>
<li>Second derivative constant (-1 or 6)</li>
<li>Derivative is $2^n$</li>
<li>Multiply by alternatingly 2 and 5</li>
<li>Alternatingly divide by 5 and increase by 10</li>
<li>Factorial series, multiply by increasing numbers</li>
<li>Treating the digits as separate series with modular arithmetic</li>
</ul>
<p>The website <a href="https://www.jobtestprep.co.uk/number-series-test?idev_username=Fiboni-JTP">jobtestprep.co.uk</a> has an article about these series and gives patterns that are just like the ones that we have identified from the examples.</p>
<p>There likely are even more patterns that one could identify in other examples. As they are mostly arbitrary and whatever the test maker thought to be an intuitive and obvious pattern, we can just take a few of them and see whether we can get a machine learning algorithm to work on them.</p>
<h2 id="machine-learning-approaches">Machine learning approaches</h2>
<p>It would be rather straightforward to write a program using conventional methods. I would take the first and second derivative, look whether there are even-odd patterns to be detected, take ratios and inverse ratios to find integer numbers in them. This would solve most of the ones already. One would need to come up with a decision tree to look for the patterns. But that would be boring.</p>
<p>Fundamentally I see two different approaches to this problem.</p>
<ol>
<li>
<p>One could try to to predict the next number directly by letting the system learn to match the first five elements to the sixth one. In testing one would just give it a bunch of five element sequences and check whether it produces the desired answer.</p>
</li>
<li>
<p>Categorize the sequences according to the patterns that may occur. This will be a finite classification problem then. In order to predict the next element one would have to also implement the rules to produce the next number. Having a separate category for <em>increase by one</em> and <em>increase by two</em> would quickly make the space of sequences uncontrollably large. Instead one should rather only have the category and then the prescription would still need to compute the derivative or ratio.</p>
</li>
</ol>
<p>Next one needs to figure out how the data should be encoded. The options would be to just have the data as is, scaled down to floating point numbers in the interval $[-1, 1]$. Also one could use the one-hot encoding for the numbers, needing a lot of dimensions to make it work properly. I will restrict the series to a range from like 0 to 100 at first such that it does not have such a hard time to get it. I want to start with just having the data as is.</p>
<h2 id="raw-data-plain-linear">Raw data, plain linear</h2>
<p>To start with, I will just use the simplest type of series that I can imagine. </p>
<pre class="code literal-block"><span></span><code><span class="n">sample_count</span> <span class="o">=</span> <span class="mi">10000</span>
<span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">default_rng</span><span class="p">()</span>
<span class="n">increase</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">6</span><span class="p">)</span>
<span class="n">starts</span> <span class="o">=</span> <span class="n">rng</span><span class="o">.</span><span class="n">integers</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="n">sample_count</span><span class="p">)</span>
<span class="n">slopes</span> <span class="o">=</span> <span class="n">rng</span><span class="o">.</span><span class="n">integers</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="n">sample_count</span><span class="p">)</span>
<span class="n">sequences</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">starts</span><span class="p">)</span><span class="o">.</span><span class="n">T</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">slopes</span><span class="p">)</span><span class="o">.</span><span class="n">T</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">increase</span><span class="p">)</span>
<span class="n">normalization</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">max</span><span class="p">(</span><span class="n">sequences</span><span class="p">)</span>
<span class="n">sequences</span> <span class="o">=</span> <span class="n">sequences</span> <span class="o">/</span> <span class="n">normalization</span>
</code></pre>
<h3 id="dense-network">Dense network</h3>
<pre class="code literal-block"><span></span><code><span class="n">model</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,)))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span>
<span class="n">model</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="s1">'rmsprop'</span><span class="p">,</span>
<span class="n">loss</span><span class="o">=</span><span class="s1">'mean_absolute_error'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s1">'mean_squared_error'</span><span class="p">])</span>
</code></pre>
<pre class="code literal-block"><span></span><code><span class="gh">Layer (type) Output Shape Param # </span>
<span class="gh">=================================================================</span>
<span class="gh">dense_22 (Dense) (None, 32) 192 </span>
<span class="gh">_________________________________________________________________</span>
<span class="gh">dense_23 (Dense) (None, 1) 33 </span>
<span class="gh">=================================================================</span>
Total params: 225
Trainable params: 225
</code></pre>
<pre class="code literal-block"><span></span><code><span class="n">history</span> <span class="o">=</span> <span class="n">model</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span>
<span class="n">data</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span>
<span class="n">epochs</span><span class="o">=</span><span class="mi">30</span><span class="p">,</span>
<span class="n">batch_size</span><span class="o">=</span><span class="mi">20</span><span class="p">,</span>
<span class="n">validation_split</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
</code></pre>
<p><img alt="" src="https://martin-ueding.de/posts/number-sequence-tests/result-1591628532.svg"></p>
<p>The predictions that come out of the network would need to be rounded to the next integer to make the prediction. I check how many of the deviations have an absolute value of greater than 0.5, which would mean a wrong prediction. That rate is 4.3 %, so not too bad for a simple dense network with 32 features, but also not that good given that the problem could be solved so easily with conventional methods.</p>
<p>Making that 512 features does not help at all. This is the loss for that case:</p>
<p><img alt="" src="https://martin-ueding.de/posts/number-sequence-tests/result-1591628931.svg"></p>
<p>And the error rate is 76.2 %, so that clearly does not help at all.</p>
<h3 id="convolution">Convolution</h3>
<p>Next I want to try a convolutional layer. In this example at most two successive elements are connected, therefore a consolution should be sufficient. Something like a GRU or LSTM layer seems overkill. And I know that there will be one pattern which is present globally, so I want to use a global average pooling layer. I know that there are only 9 different slopes put into the data, so I should be able to get away with 16 different filters in the convolutional layer.</p>
<pre class="code literal-block"><span></span><code><span class="n">model</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">Sequential</span><span class="p">()</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Conv1D</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="p">(</span><span class="mi">3</span><span class="p">,),</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">1</span><span class="p">)))</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">GlobalAveragePooling1D</span><span class="p">())</span>
<span class="n">model</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">1</span><span class="p">))</span>
</code></pre>
<p>This model seems to perform very well on the data, with only 1.5 % error rate. The mean absolute error seems to be rather stable at 0.5, which is just the acceptable threshold.</p>
<p><img alt="" src="https://martin-ueding.de/posts/number-sequence-tests/result-1591629441.svg"></p>
<h2 id="raw-data-quadratic-forms">Raw data, quadratic forms</h2>
<p>I now start to add some quadratic forms to the mix using the following generating code:</p>
<pre class="code literal-block"><span></span><code><span class="n">starts</span> <span class="o">=</span> <span class="n">rng</span><span class="o">.</span><span class="n">integers</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="n">sample_count</span><span class="p">)</span>
<span class="n">slopes</span> <span class="o">=</span> <span class="n">rng</span><span class="o">.</span><span class="n">integers</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="n">sample_count</span><span class="p">)</span>
<span class="n">curvatures</span> <span class="o">=</span> <span class="n">rng</span><span class="o">.</span><span class="n">integers</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="n">sample_count</span><span class="p">)</span>
<span class="n">sequences</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">starts</span><span class="p">)</span><span class="o">.</span><span class="n">T</span> \
<span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">slopes</span><span class="p">)</span><span class="o">.</span><span class="n">T</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">increase</span><span class="p">)</span> \
<span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">curvatures</span><span class="p">)</span><span class="o">.</span><span class="n">T</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">increase</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span>
</code></pre>
<p>This way the system needs to learn a curvature. Sadly, this seems to be hard and the error rate is at 89.4 % with a high loss:</p>
<p><img alt="" src="https://martin-ueding.de/posts/number-sequence-tests/result-1591629988.svg"></p>
<p>I would have thought that a convolution could learn to figure out a slope. With the appropriate kernel it can learn to measure slope: $a_n - a_{n-1}$ would be a suitable one. It could also use a kernel like $a_{n-1} - 2 a_n + a_{n+1}$ to measure curvature. Perhaps the problem was that I have only allowed 16 filters to be measured.</p>
<p>Allowing for 64 filters does not really make it much better, this brings us to 83.9 %.</p>
<p>Perhaps convolutions of length 3 are not sufficient. So I make it try a length of 5. The pooling layer should become useless now. The error rate is still 83.2 %, so that did not help.</p>
<p>So I increase to 512 filters with length 5, perhaps that does the trick. We are now at 99.0 % error rate. The network seems to overfit, although the validation loss looks reasonable close to the training loss.</p>
<p><img alt="" src="https://martin-ueding.de/posts/number-sequence-tests/result-1591630438.svg"></p>
<p>We have seen this pattern before as well, too many convolutional filters will make the result worse. Perhaps one could improve by training for more epochs, but a simpler model gave better results with less resources.</p>
<h2 id="conclusions">Conclusions</h2>
<p>It seems that just applying a convolution to the data is not sufficient to extract all the interesting features. This surprises me a bit because I would have expected that a convolution could effectively take the first and second derivative of data. The dense network at the end must be able to build the sum of the last data point and first and second derivative to extrapolate the next point. Perhaps the problem is that the convolution does not preserve the last element and one would have to re-inject the original information into the final state to have it available via short-circuit.</p>
<p>Maybe even with those changes the convolution just does not work. One would need to extract features like the first and second derivative, the ratios and other things beforehand. Some of these are non-linear transformations, the network may not be able to learn these just from the data and one has to provide some features manually.</p>
<p>It could also be that deep learning is not the right approach here and as it should rather be a decision tree from more complicated features, one should rather persue an avenue like that.</p></div>EnglishKerasMachine Learninghttps://martin-ueding.de/posts/number-sequence-tests/Tue, 09 Jun 2020 22:00:00 GMT
- Fit Range Determination with Machine Learninghttps://martin-ueding.de/posts/fit-range-determination-with-machine-learning/Martin Ueding<div><p>One of the most tedious and error-prone things in my work in Lattice QCD is the manual choice of fit ranges. While reading up on Keras, deep neural networks and machine learning and how experimental the whole field is, I thought about just trying the fit range selection with deep learning.</p>
<p>We have correlation functions $C(t)$ which behave as $\sum_n A_n \exp(-E_n t)$ plus noise. The $E_n$ are the energies of the state $n$, the $A_n$ are the respective amplitudes. We are interested in extracting the smallest of the $E_n$, the ground state energy. We use that for sufficiently large times $t$ the term with the smallest energy dominates the expression. Without loss of generality we say $E_0 < E_1 < \ldots$ and formally write
$$ \lim_{t \to \infty} C(t) = A_0 \exp(-E_0 t) \,. $$</p>
<p>By taking the <em>effective mass</em> as defined by
$$ m_\text{eff}(t) = - \log\left(\frac{C(t)}{C(t+1)}\right) $$
we get $m_\text{eff}(t) \sim E_0$ in the region of large $t$. There are more subtleties involed (back-propagation, thermal states), which we will ignore here. The effective mass is expected to be constant in some region of the data where $t$ is sufficiently large such that the higher states have decayed; yet the exponentially decaying signal-to-noise-ratio is still sufficiently good. An example for such an effective mass is the following.</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/effmass_example.png"></p>
<!-- TEASER_END -->
<p>Fitting a constant to the effective mass allows to extract the energy of the ground state, $E_0$. In the above image one can see such a manually chosen fit range. It starts after the excited states that come from above have decayed and stops before the noise takes over. Such a <em>pleateau</em> must have all data points statistically compatible with the fitted value, fluctuations shall be $\chi^2$ distributed. This is fancy for saying that most should lie within one error bar, some within two error bars and only very few within three error bars or more to the fitted line.</p>
<p>For my dissertation I have to determine around 500 of these ranges, and it is getting boring rather quickly. Especially after every change in the data, this needs to be re-done. So perhaps after doing a few hundred of them, I could train a neural network to do this work for me? Already at this point I know that even if I should find such a solution to this problem, it would need a lot of vetting from my peers before it would be taken credible. Therefore I will still need to verify all the fit ranges by hand. Still I find it an interesting side project to look at.</p>
<p>For this project I again use a <a href="https://jupyter.org/">Jupyter Notebook</a>, which is just as great of a platform for Python as <a href="https://rmarkdown.rstudio.com/">R Markdown</a> is for R. I can recommend it over working with a script file in both languages.</p>
<h2 id="transferring-the-data">Transferring the data</h2>
<p>I have all my analysis data in R. Machine learning with Keras is done in Python. So I have used <a href="http://dirk.eddelbuettel.com/code/rcpp.cnpy.html">RcppCNPy</a> to export my data from R into the NumPy format. There is a limitation that only 1D and 2D data structures can be exported. Also one needs to keep in mind that R has the <a href="https://en.wikipedia.org/wiki/Row-_and_column-major_order">column-major layout</a> that FORTRAN uses whereas NumPy uses the row-major layout of C. I transpose the tensor in R using <code>aperm</code> before storing it with <code>npySave</code>.</p>
<p>From my analysis I have a lot of things available, but the neural network can likely only look at the effective mass. The actual correlator varies on large scales and from what I read the neural networks like data that is somewhat normally distributed. I also need to export the uncertainties of each point. The central values may fluctuate around the plateau within their errors. Just looking at the central values is not enough. I need both as input, although I am not sure how values and errors should be fed into the neural network.</p>
<p>I export the data for a particular ensemble only. This might be a problem with generalization to other ensembles, but then the lattice spacing and pion mass would also need to be input to the neural network. I want to keep it simple. On the cA2.60.32 ensemble we always have a time extent of $T = 64$ such that half the time (correlator is symmetric and therefore redunadant) is 32 slices. The resulting data tensor will be of shape $(N, 32, 2)$ for $N = 142$ measurements. 32 time slices and the two features (value, error).</p>
<p>In R I have the transposed structure with shape $(2, 32, N)$. So make sure that I have the data correctly transferred, I make a plot of the 7th correlator in R:</p>
<pre class="code literal-block"><span></span><code><span class="n">hadron</span><span class="o">::</span><span class="nf">plotwitherror</span><span class="p">(</span>
<span class="n">x</span> <span class="o">=</span> <span class="m">1</span><span class="o">:</span><span class="nf">length</span><span class="p">(</span><span class="n">all_data</span><span class="p">[</span><span class="m">1</span><span class="p">,</span> <span class="p">,</span> <span class="m">7</span><span class="p">]),</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">all_data</span><span class="p">[</span><span class="m">1</span><span class="p">,</span> <span class="p">,</span> <span class="m">7</span><span class="p">],</span>
<span class="n">dy</span> <span class="o">=</span> <span class="n">all_data</span><span class="p">[</span><span class="m">2</span><span class="p">,</span> <span class="p">,</span> <span class="m">7</span><span class="p">])</span>
</code></pre>
<p>And then I do the same thing with my NumPy data structure. Keep in mind that R is 1-indexed and Python is 0-indexed.</p>
<pre class="code literal-block"><span></span><code><span class="n">ax</span><span class="o">.</span><span class="n">errorbar</span><span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">33</span><span class="p">),</span>
<span class="n">data</span><span class="p">[</span><span class="mi">6</span><span class="p">,</span> <span class="p">:,</span> <span class="mi">0</span><span class="p">],</span>
<span class="n">data</span><span class="p">[</span><span class="mi">6</span><span class="p">,</span> <span class="p">:,</span> <span class="mi">1</span><span class="p">],</span>
<span class="n">marker</span><span class="o">=</span><span class="s1">'o'</span><span class="p">,</span>
<span class="n">linestyle</span><span class="o">=</span><span class="s1">'none'</span><span class="p">)</span>
</code></pre>
<p>This gives me the same looking plot and I am confident that I have the data just the way that I want it.</p>
<p>The fit ranges (target values) are easier, I have tensor of shape $(N, 2)$ which contains the beginning and end of the fit range as an integer.</p>
<h2 id="choosing-the-network-model">Choosing the network model</h2>
<p>As the correlator data that I analyze is a time series, there are two options that I already saw covered in the book:</p>
<ol>
<li>
<p>A recurrent neural network (RNN), made with LSTM or GRU layers.</p>
</li>
<li>
<p>A convolutional neural network (convnet), made with convolutional and pooling layers.</p>
</li>
</ol>
<p>I think that we do not really need that much global information, we want to check locally for a plateau. So we will first start with a convolutional layer. Perhaps later we try the recurrent neural network as well. Luckily Keras is so easy to work with that one can just exchange the building blocks and train the network again.</p>
<h2 id="encoding-the-target-data">Encoding the target data</h2>
<p>Then we need to figure out a way to encode the target data. Just having two integers is likely not going to work very well. If we were to target only a single integer, we would use a one-hot encoding for the numbers, a softmax activation function and categorical crossentropy as loss function. We have two integers, so perhaps we need to have a non-sequental network graph to generate two one-hot encoded outputs.</p>
<h3 id="marking-the-plateau">Marking the plateau</h3>
<p>An alternative would be to mark the plateau region by having the plateau region all 1's and everything around all 0's. The neural network would basically give the chance of a point belonging to a plateau on every single time slice.</p>
<p>This is easily generated from the given data. One just has to be careful that the <a href="https://github.com/HISKP-LQCD/hadron">hadron</a> fit routine takes <code>tmin</code> and <code>tmax</code> being inclusive-inclusive whereas Python slicing takes them to be inclusive-exclusive. Also they are 0-based array indices, therefore just a treatment on the <code>tmin</code> is needed.</p>
<pre class="code literal-block"><span></span><code><span class="n">target</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">n_meas</span><span class="p">,</span> <span class="mi">32</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_meas</span><span class="p">):</span>
<span class="n">tmin</span><span class="p">,</span> <span class="n">tmax</span> <span class="o">=</span> <span class="n">labels</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="p">:]</span>
<span class="n">target</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="p">(</span><span class="n">tmin</span><span class="o">-</span><span class="mi">1</span><span class="p">):</span><span class="n">tmax</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
</code></pre>
<p>This encoding then looks like the following with <code>ax.imshow(target)</code>:</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/Bildschirmfoto_20200531-20:26:56-c5a-Auswahl.png"></p>
<p>The training process also needs a loss function and a metric to judge the success. Looking at the <a href="https://www.tensorflow.org/api_docs/python/tf/keras/losses">documentation for the losses</a> we can see that there are a bunch of them. The <em>categorical crossentropy</em> is not applicable here, so we just try the <em>mean absolute error</em> which is defined as <code>mean(abs(y_true - y_pred))</code>. We therefore get a penality for every point that is marked as a pleateau but should not and vice versa.</p>
<p>For the activation I am not sure what to use, we will just go with a sigmoid function to amplify it towards either a 0 or 1.</p>
<p>In order to measure success in the end I use the <em>false positive</em> and <em>false negatives</em> metric. This way we can see how many of the $142 \times 32 = 4544$ result elements were computed incorrectly and how it is biased.</p>
<p>One problem with this approach certainly is that the fit range does not need to be consecutive. Holes in the fit range could be represented this way, but we do not want to allow this.</p>
<h3 id="one-hot-encoding-start-and-end">One-hot encoding start and end</h3>
<p>An alternative approach would be to use one-hot encoding for the start and also for the end. The <a href="https://www.tensorflow.org/api_docs/python/tf/keras/activations/softmax">softmax</a> transformation also has an <code>axis=-1</code> default argument which means that is just applied by that axis and we can have both start and end in the same result data.</p>
<p>The encoding is straightforward.</p>
<pre class="code literal-block"><span></span><code><span class="n">target</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">n_meas</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">32</span><span class="p">))</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_meas</span><span class="p">):</span>
<span class="n">tmin</span><span class="p">,</span> <span class="n">tmax</span> <span class="o">=</span> <span class="n">labels</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="p">:]</span>
<span class="n">target</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">tmin</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">target</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">tmax</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
</code></pre>
<p>And the result is just as expected, here is just the first fit range shown:</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/one-hot.png"></p>
<p>For the loss we can use the <em>categorical crossentropy</em>, and the metric will be <em>accuracy</em>. This then tells us how many starts and ends have been determined correctly.</p>
<h2 id="dense-approach">Dense approach</h2>
<p>Before trying anything more fancy, we can just go ahead with a simple dense model. Chollet writes that one should try with the simplest model first and then justify the expense of trying more complex models by the simples ones not performing well.</p>
<h3 id="using-marked-plateau">Using marked plateau</h3>
<p>The simple dense model that we will try first is defined as such:</p>
<pre class="code literal-block"><span></span><code><span class="n">network0</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">Sequential</span><span class="p">()</span>
<span class="n">network0</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'relu'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>
<span class="n">network0</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Flatten</span><span class="p">())</span>
<span class="n">network0</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'sigmoid'</span><span class="p">))</span>
<span class="n">network0</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="s1">'rmsprop'</span><span class="p">,</span>
<span class="n">loss</span><span class="o">=</span><span class="s1">'mean_absolute_error'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="n">keras</span><span class="o">.</span><span class="n">metrics</span><span class="o">.</span><span class="n">FalsePositives</span><span class="p">(),</span>
<span class="n">keras</span><span class="o">.</span><span class="n">metrics</span><span class="o">.</span><span class="n">FalseNegatives</span><span class="p">()])</span>
</code></pre>
<p>The model therefore looks like this after compilation:</p>
<pre class="code literal-block"><span></span><code>Layer (type) Output Shape Param #
=================================================================
dense_27 (Dense) (None, 32, 128) 384
_________________________________________________________________
flatten_16 (Flatten) (None, 4096) 0
_________________________________________________________________
dense_28 (Dense) (None, 32) 131104
=================================================================
Total params: 131,488
Trainable params: 131,488
Non-trainable params: 0
</code></pre>
<p>We then train the network with these options:</p>
<pre class="code literal-block"><span></span><code><span class="n">history</span> <span class="o">=</span> <span class="n">network0</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span>
<span class="n">data</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span>
<span class="n">epochs</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span>
<span class="n">batch_size</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
<span class="n">validation_split</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
</code></pre>
<p>The loss and metric look like this over the epochs:</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/network0-loss-1.png"></p>
<p>A mean absolute error of 0.12 means that 12 % of the time slices results are incorrect as the absolute error per slice is either 0.0 or 1.0. And looking at the rate of false positive and false negatives, we see that we have like 8 % false positives and 4 % false negatives.</p>
<p>The encoding of the plateaus shows us that the network has not really learned that much about the data but really just assumes pretty much the same range for most data sets.</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/network0-target-actual-1.png"></p>
<p>Taking the difference between actual and target shows that there are many mistakes and that this model is not that great.</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/network0-target-actual2-1.png"></p>
<p>We are not overfitting, perhaps one should just give it more freedom? Not, it does not seem to get any better than the 12 % error rate.</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/network0-loss-2.png"></p>
<p>That is the baseline that we would have to beat.</p>
<h3 id="using-one-hot-start-and-end">Using one-hot start and end</h3>
<p>We can also try this model using the other encoding of the target data. I am not quite sure how that works exactly with the activation because the dense layers cannot have a shape but must be flat. So I try to reshape and then use the softmax activation later on.</p>
<pre class="code literal-block"><span></span><code><span class="n">network0</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">Sequential</span><span class="p">()</span>
<span class="n">network0</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'relu'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>
<span class="n">network0</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Flatten</span><span class="p">())</span>
<span class="n">network0</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'relu'</span><span class="p">))</span>
<span class="n">network0</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Reshape</span><span class="p">((</span><span class="mi">2</span><span class="p">,</span> <span class="mi">32</span><span class="p">)))</span>
<span class="n">network0</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Activation</span><span class="p">(</span><span class="s1">'softmax'</span><span class="p">))</span>
<span class="n">network0</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="s1">'rmsprop'</span><span class="p">,</span>
<span class="n">loss</span><span class="o">=</span><span class="s1">'categorical_crossentropy'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s1">'accuracy'</span><span class="p">])</span>
</code></pre>
<p>The results are devastating. It starts to overfit pretty much right from the start:</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/network0-loss-3.png"></p>
<p>When just looking at the start of the fit range, it does not look appealing either.</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/network0-target-actual-3.png"></p>
<p>In the difference plot one can see that the start of the fit range is off by a few elements.</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/network0-target-actual2-3.png"></p>
<p>Given that 80 % of the data has been used for training and that it is overfitting, this does not look too good. One could try to regularize this model to make the overfitting less pronounced, but I fear that this won't make it any better.</p>
<h2 id="convolutional-approach">Convolutional approach</h2>
<p>The convolutional layer can combine information from the local neighborhood. This makes a lot of sense for finding a pleateau because it should identify parts where the central values have no trend (linear coefficient) but also no curvature (quadratic coefficient).</p>
<p>We also need to somehow make it use the uncertainty as well as the central values. The central values $m_\text{eff}(t)$ may vary around $\Delta m_\text{eff}(t)$, but not much more. Basically $m_\text{eff}(t) \pm \Delta m_\text{eff}(t)$ is the corridor where it may vary. With a 2D convolutional layer the neural network might be able to pick up this information somehow and massage it into features like “constant within errors” and “upwards/downwards trend within errors”.</p>
<p>The target encoding using 1's in the pleateau region and 0's elsewhere seems to make sense here.</p>
<p>The network that I have chosen is the following:</p>
<pre class="code literal-block"><span></span><code><span class="n">network1</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">Sequential</span><span class="p">()</span>
<span class="n">network1</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Conv2D</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'relu'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">)))</span>
<span class="c1">#network1.add(keras.layers.Reshape((30, 32)))</span>
<span class="c1">#network1.add(keras.layers.MaxPooling1D((2,)))</span>
<span class="c1">#network1.add(keras.layers.Conv1D(64, (3,), activation='relu'))</span>
<span class="c1">#network1.add(keras.layers.MaxPooling1D((2,)))</span>
<span class="c1">#network1.add(keras.layers.Conv1D(64, (3,), activation='relu'))</span>
<span class="c1">#network1.add(keras.layers.MaxPooling1D((2,)))</span>
<span class="n">network1</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Flatten</span><span class="p">())</span>
<span class="n">network1</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.3</span><span class="p">))</span>
<span class="n">network1</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'relu'</span><span class="p">))</span>
<span class="n">network1</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'sigmoid'</span><span class="p">))</span>
<span class="nb">print</span><span class="p">(</span><span class="n">network1</span><span class="o">.</span><span class="n">summary</span><span class="p">())</span>
<span class="n">network1</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="s1">'rmsprop'</span><span class="p">,</span>
<span class="n">loss</span><span class="o">=</span><span class="s1">'mean_absolute_error'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="n">keras</span><span class="o">.</span><span class="n">metrics</span><span class="o">.</span><span class="n">FalsePositives</span><span class="p">(),</span>
<span class="n">keras</span><span class="o">.</span><span class="n">metrics</span><span class="o">.</span><span class="n">FalseNegatives</span><span class="p">()])</span>
</code></pre>
<p>It starts with a convolutional layer that uses a 3×2 stencil to pick up the error from the other feature dimension. This way it should be able to build stencils that resolve a trend within errors. As there are only linear transformations, it likely cannot do a $t$-test, so we might see limitations.</p>
<p>I let it go directly to a dense classification network in the hope that this would become a somewhat diagonal thing and pick out the applicable stencils that the convolution has learned.</p>
<p>Keras provides the following summary of the model:</p>
<pre class="code literal-block"><span></span><code>Layer (type) Output Shape Param #
=================================================================
conv2d_39 (Conv2D) (None, 30, 1, 64) 448
_________________________________________________________________
flatten_39 (Flatten) (None, 1920) 0
_________________________________________________________________
dropout_3 (Dropout) (None, 1920) 0
_________________________________________________________________
dense_70 (Dense) (None, 128) 245888
_________________________________________________________________
dense_71 (Dense) (None, 32) 4128
=================================================================
Total params: 250,464
Trainable params: 250,464
Non-trainable params: 0
</code></pre>
<p>The results are slightly worse than with the pure dense network.</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/network1-loss-1.png"></p>
<p>From the target-actual-plot I would even think that it shows less individuality for each measurement but treats them mostly the same.</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/network1-target-actual-1.png"></p>
<p>And in the difference plot it also looks disheartening.</p>
<p><img alt="" src="https://martin-ueding.de/posts/fit-range-determination-with-machine-learning/network1-target-actual2-1.png"></p>
<p>Adding all the additional blocks made of convolutional and pooling layers does not improve anything. This does not surprise me as for this problem we don't really need more complicated global features (like with image classification) but rather need the spatial information.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I have now tried a few different models and parameterizations of the data. It feels as if either there is insufficient data to actually solve this problem in a satisfactory fashion or I am not experienced enough to find a good neural network for this problem. I haven't tried the recurrent layers yet, perhaps they also won't work that well.</p>
<p>Last year we had a discussion of this exact problem with machine learning specialists and they deemed this a hard problem. If a simple dense or convolutional network would have been the answer, they likely would have suggested it. Therefore I am happy to have played around with it, but also am willing to just leave it at this for now.</p>
<p>Even if this would reproduce exactly the fit ranges that I have chosen, it would be unclear how the systematic error from chosing the fit range would be treated. The neural network cannot really explain its reasoning like a human could try, so one would be stuck with a sortof black box in the analysis chain.</p></div>EnglishKerasMachine LearningPhysicshttps://martin-ueding.de/posts/fit-range-determination-with-machine-learning/Sun, 31 May 2020 22:00:00 GMT
- Simple Captcha with Deep Neural Networkhttps://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/Martin Ueding<div><p>The other day I had to fill in a captcha on some website. Most sites today use
Google's <a href="https://www.google.com/recaptcha/intro/v3.html">reCAPTCHA</a>. It shows
little image tiles and asks you to classify them. They use this to train a
neutral network to classify situations for autonomous driving. Writing a
program to solve this captcha would require obscene amounts of data to train a
neutral network. And if that would already exist, autonomous cars would be here
already.</p>
<p>The captcha on that website, however, was of the old and simple kind:</p>
<p><img alt="" src="https://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/captcha-example.png"></p>
<p>It is just six numbers (and always six numbers), the concentric circles and
some pepper noise. These kind of captchas are outdated because one can solve
them with machine learning. And as I am currently working through <a href="https://www.manning.com/books/deep-learning-with-python">“Deep
Learning with Python” by François
Chollet</a> and was
looking for a practise project, this captcha came as inspiration at just the
right moment.</p>
<!-- END_TEASER -->
<h2 id="obtaining-data">Obtaining data</h2>
<p>In order to do machine learning, one needs training data. I have downloaded a
few of the captchas generated by the website. They look like this:</p>
<p><img alt="" src="https://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/captcha-montage.png"></p>
<p>We could just download more and more captchas to use as training data. But then
I would have to solve them all, and I do not want to do that. Instead, I'd
rather build a very similar captcha generator and then just generate my
training, validation and test data from that generator.</p>
<p>In order to generate, we need to observe how the captcha images are
constructed. One can clearly see how the numbers always have almost the same
$x$-location and the $y$-location is taken randomly. Also the numbers seem to
have a random rotation applied to them.</p>
<p>Magnifying the “8” from the example shows that the digits are done with
anti-alias, whereas the circles and the noise are not.</p>
<p><img alt="" src="https://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/single-zoom.png"></p>
<p>To generate something like this we can use the <a href="https://pillow.readthedocs.io/">Pillow
library</a>. There we got drawing functions that
can be used to generate such a captcha.</p>
<p>First we generate new image with the appropriate size and gray background. Also
we load some font and create a new <code>ImageDraw</code> object such that we can draw
onto the image.</p>
<pre class="code literal-block"><span></span><code><span class="n">image</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="s1">'L'</span><span class="p">,</span> <span class="p">(</span><span class="mi">310</span><span class="p">,</span> <span class="mi">80</span><span class="p">),</span> <span class="mi">221</span><span class="p">)</span>
<span class="n">font</span> <span class="o">=</span> <span class="n">ImageFont</span><span class="o">.</span><span class="n">truetype</span><span class="p">(</span><span class="s1">'Pillow/Tests/fonts/FreeSans.ttf'</span><span class="p">,</span> <span class="mi">27</span><span class="p">)</span>
<span class="n">d</span> <span class="o">=</span> <span class="n">ImageDraw</span><span class="o">.</span><span class="n">Draw</span><span class="p">(</span><span class="n">image</span><span class="p">)</span>
</code></pre>
<p>Next we need do add the black border box that goes around the captcha.</p>
<pre class="code literal-block"><span></span><code><span class="n">d</span><span class="o">.</span><span class="n">line</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">309</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">width</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># Top</span>
<span class="n">d</span><span class="o">.</span><span class="n">line</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">79</span><span class="p">,</span> <span class="mi">309</span><span class="p">,</span> <span class="mi">79</span><span class="p">],</span> <span class="n">width</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># Bottom</span>
<span class="n">d</span><span class="o">.</span><span class="n">line</span><span class="p">([</span><span class="mi">309</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">309</span><span class="p">,</span> <span class="mi">79</span><span class="p">],</span> <span class="n">width</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># Right</span>
<span class="n">d</span><span class="o">.</span><span class="n">line</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">79</span><span class="p">],</span> <span class="n">width</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># Left</span>
</code></pre>
<p>The ellipses are easy to measure in a program like GIMP. They have a spacing of
16 pixels in width and 8 pixels in height. Then we just need to increase the
size of the bounding box and have a bunch of concentric ellipses.</p>
<pre class="code literal-block"><span></span><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">15</span><span class="p">):</span>
<span class="n">ellipse_width</span> <span class="o">=</span> <span class="mi">16</span>
<span class="n">ellipse_height</span> <span class="o">=</span> <span class="mi">8</span>
<span class="n">bbox</span> <span class="o">=</span> <span class="p">(</span><span class="n">image</span><span class="o">.</span><span class="n">width</span> <span class="o">//</span> <span class="mi">2</span> <span class="o">-</span> <span class="n">i</span> <span class="o">*</span> <span class="n">ellipse_width</span><span class="p">,</span>
<span class="n">image</span><span class="o">.</span><span class="n">height</span> <span class="o">//</span> <span class="mi">2</span> <span class="o">-</span> <span class="n">i</span> <span class="o">*</span> <span class="n">ellipse_height</span><span class="p">,</span>
<span class="n">image</span><span class="o">.</span><span class="n">width</span> <span class="o">//</span> <span class="mi">2</span> <span class="o">+</span> <span class="n">i</span> <span class="o">*</span> <span class="n">ellipse_width</span><span class="p">,</span>
<span class="n">image</span><span class="o">.</span><span class="n">height</span> <span class="o">//</span> <span class="mi">2</span> <span class="o">+</span> <span class="n">i</span> <span class="o">*</span> <span class="n">ellipse_height</span><span class="p">)</span>
<span class="n">d</span><span class="o">.</span><span class="n">ellipse</span><span class="p">(</span><span class="n">bbox</span><span class="p">,</span> <span class="n">outline</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre>
<p>Next are the digits. When looking through many of the generated captchas one
can see that there never are “1” or “7”. Also there never is a “9”, it is just
“6”. That means that the list of digits is severely reduced and we just
randomly pick out of that set. The $x$-locations I just read off from one of
the sample captchas. Since I cannot find the exact font, it is not going to be
pixel perfect either, so a slight shift in the locations does not hurt that
much.</p>
<p>In order to rotate the digit I draw it onto a temporary image with alpha
channel. Then I paste that onto the actual image.</p>
<pre class="code literal-block"><span></span><code><span class="n">allowed_digits</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">8</span><span class="p">]</span>
<span class="n">digits</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">17</span><span class="p">,</span> <span class="mi">57</span><span class="p">,</span> <span class="mi">115</span><span class="p">,</span> <span class="mi">170</span><span class="p">,</span> <span class="mi">217</span><span class="p">,</span> <span class="mi">260</span><span class="p">]:</span>
<span class="n">txt</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">new</span><span class="p">(</span><span class="s1">'LA'</span><span class="p">,</span> <span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">25</span><span class="p">),</span> <span class="p">(</span><span class="mi">150</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>
<span class="n">dx</span> <span class="o">=</span> <span class="n">ImageDraw</span><span class="o">.</span><span class="n">Draw</span><span class="p">(</span><span class="n">txt</span><span class="p">)</span>
<span class="n">digit</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">allowed_digits</span><span class="p">)</span>
<span class="n">digits</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">digit</span><span class="p">)</span>
<span class="n">dx</span><span class="o">.</span><span class="n">text</span><span class="p">((</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span> <span class="nb">str</span><span class="p">(</span><span class="n">digit</span><span class="p">),</span> <span class="n">font</span><span class="o">=</span><span class="n">font</span><span class="p">,</span> <span class="n">fill</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">255</span><span class="p">))</span>
<span class="n">w</span> <span class="o">=</span> <span class="n">txt</span><span class="o">.</span><span class="n">rotate</span><span class="p">(</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="o">-</span><span class="mi">45</span><span class="p">,</span> <span class="mi">45</span><span class="p">),</span> <span class="n">expand</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">14</span><span class="p">,</span> <span class="mi">45</span><span class="p">)</span>
<span class="n">image</span><span class="o">.</span><span class="n">paste</span><span class="p">(</span><span class="n">w</span><span class="p">,</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">),</span> <span class="n">w</span><span class="o">.</span><span class="n">getchannel</span><span class="p">(</span><span class="s1">'A'</span><span class="p">))</span>
</code></pre>
<p>For the pepper noise I just take 500 random points and make them black. I have
tried 1000 points, but that was a bit too much.</p>
<pre class="code literal-block"><span></span><code><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">500</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">309</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">79</span><span class="p">)</span>
<span class="n">image</span><span class="o">.</span><span class="n">putpixel</span><span class="p">((</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">),</span> <span class="mi">0</span><span class="p">)</span>
</code></pre>
<p>In order to know which digits are actually contained in the file, I put them
into the file name. And also a random number string such that I can have the
exact combination multiple times without having to worry about clashes.</p>
<pre class="code literal-block"><span></span><code><span class="n">filename</span> <span class="o">=</span> <span class="s1">'</span><span class="si">{}</span><span class="s1">-</span><span class="si">{}</span><span class="s1">.png'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>
<span class="s1">''</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="nb">str</span><span class="p">,</span> <span class="n">digits</span><span class="p">)),</span>
<span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1000</span><span class="p">,</span> <span class="mi">9999</span><span class="p">))</span>
<span class="n">image</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
</code></pre>
<p>I think that the results look rather okay, although it is clear that the font
is not exactly the same.</p>
<p><img alt="" src="https://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/montage-mine.png"></p>
<p>Using my script I can just generate as many of these as I want. I will just
start with 2000 samples such that I have 1000 for training, 500 for validation
and 500 for testing.</p>
<h2 id="preprocessing">Preprocessing</h2>
<p>We always have six digits in this captcha. One should always use available
knowledge to make a numeric problem easier to solve. So we can just slice the
captcha images into six slices and do a digit detection on each of them
separately. I think that good slicing $x$-values are 54, 110, 170, 215, 262.</p>
<p>At this point I am not sure whether one should train one neural network per
slice or use one neural network for all of them.</p>
<ul>
<li>
<p>One network per slice means that it can fully learn the constant background
of the ellipse and that it does not matter at all. So perhaps that would
give us better accuracy.</p>
</li>
<li>
<p>If we are going for the latter, we would need to have something that works
with different sizes. In this particular case we can just make the slices
51 pixels wide and just drop the gaps. Then the sizes are exactly the same.</p>
<p>But perhaps the neural network will have six times the input and learns to
classify the digits better. The background would be a bit harder, because
the neutral network would not know from which part of the image the slice
came and therefore might have a harder time.</p>
</li>
</ul>
<p>I just don't know in advance, so we will just have to find it out! The first
option seems a bit easier to implement and test, so I will start that.</p>
<p>The loading of the image files can be done with the Pillow library or the Keras
wrapper <code>keras.preprocessing.image.load_img</code>. I have to convert the images to
grayscale and normalize them to fall into the interval $[0, 1]$. Then I take
the pixels up to column 54 such that I only look at the first digit with its
background. Additionally one needs to make sure that although they are
grayscale image, the dimensionality is still 3. The label needs to be in
one-hot-encoding. The following function takes a list of filenames and loads
them all.</p>
<pre class="code literal-block"><span></span><code><span class="k">def</span> <span class="nf">load_files</span><span class="p">(</span><span class="n">paths</span><span class="p">):</span>
<span class="n">images</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">labels</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">allowed_digits</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">8</span><span class="p">]</span>
<span class="k">for</span> <span class="n">path</span> <span class="ow">in</span> <span class="n">paths</span><span class="p">:</span>
<span class="n">image</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">preprocessing</span><span class="o">.</span><span class="n">image</span><span class="o">.</span><span class="n">load_img</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="s1">'data'</span><span class="p">,</span> <span class="n">path</span><span class="p">))</span>
<span class="n">grayscale</span> <span class="o">=</span> <span class="n">image</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="s1">'L'</span><span class="p">)</span>
<span class="n">array</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">grayscale</span><span class="p">)</span>
<span class="n">normalized</span> <span class="o">=</span> <span class="n">array</span> <span class="o">/</span> <span class="mi">255</span>
<span class="n">images</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">atleast_3d</span><span class="p">(</span><span class="n">normalized</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">:</span><span class="mi">54</span><span class="p">]))</span>
<span class="n">label</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">6</span><span class="p">)</span>
<span class="n">label</span><span class="p">[</span><span class="n">allowed_digits</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="n">path</span><span class="p">[</span><span class="mi">0</span><span class="p">]))]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">labels</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">label</span><span class="p">)</span>
<span class="n">all_images</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">stack</span><span class="p">(</span><span class="n">images</span><span class="p">)</span>
<span class="n">all_labels</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">labels</span><span class="p">)</span>
<span class="k">return</span> <span class="n">all_images</span><span class="p">,</span> <span class="n">all_labels</span>
</code></pre>
<p>I then use this function to load batches of the image files that I have
generated before. The images are shuffled as the file names might be sorted and
therefore the first digit would always be a “2”.</p>
<pre class="code literal-block"><span></span><code><span class="n">files</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="s1">'data'</span><span class="p">)</span>
<span class="n">random</span><span class="o">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">files</span><span class="p">)</span>
<span class="n">train_images</span><span class="p">,</span> <span class="n">train_labels</span> <span class="o">=</span> <span class="n">load_files</span><span class="p">(</span><span class="n">files</span><span class="p">[:</span><span class="mi">1000</span><span class="p">])</span>
<span class="n">validation_images</span><span class="p">,</span> <span class="n">validation_labels</span> <span class="o">=</span> <span class="n">load_files</span><span class="p">(</span><span class="n">files</span><span class="p">[</span><span class="mi">1000</span><span class="p">:</span><span class="mi">1500</span><span class="p">])</span>
<span class="n">test_images</span><span class="p">,</span> <span class="n">test_labels</span> <span class="o">=</span> <span class="n">load_files</span><span class="p">(</span><span class="n">files</span><span class="p">[</span><span class="mi">1500</span><span class="p">:</span><span class="mi">2000</span><span class="p">])</span>
</code></pre>
<h2 id="fitting-the-first-digit">Fitting the first digit</h2>
<p>With the data in place we can now actually fit the first digit. I have taken a
<em>convolutional neutral network</em> (convnet) from the book that detects local
features via a convolution and then uses the pooling layer to gather
information that is a bit less local. With four such blocks there is a factor
$2^4 = 16$ in reduction that is scaled down. This network has been used for the
binary classification of images of cats and dogs. So I presume that it is a
resonable start to use for this problem as well.</p>
<p>On top of the convolutional layers I take a dense layer which just weighs 512
features that are extracted. And then there is another dense layer for the
activation into the six different digits that are actually present in the
captchas.</p>
<p>The network generation is done in a function because I have different input
shapes. In the book it is described that the input size can also be variable,
but this seems to be easier for a start.</p>
<pre class="code literal-block"><span></span><code><span class="k">def</span> <span class="nf">make_network</span><span class="p">(</span><span class="n">shape</span><span class="p">):</span>
<span class="n">network</span> <span class="o">=</span> <span class="n">keras</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">Sequential</span><span class="p">()</span>
<span class="n">network</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Conv2D</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'relu'</span><span class="p">,</span> <span class="n">input_shape</span><span class="o">=</span><span class="n">shape</span><span class="p">))</span>
<span class="n">network</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">MaxPooling2D</span><span class="p">((</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>
<span class="n">network</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Conv2D</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'relu'</span><span class="p">))</span>
<span class="n">network</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">MaxPooling2D</span><span class="p">((</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>
<span class="n">network</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Conv2D</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'relu'</span><span class="p">))</span>
<span class="n">network</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">MaxPooling2D</span><span class="p">((</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>
<span class="n">network</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Conv2D</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">),</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'relu'</span><span class="p">))</span>
<span class="n">network</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">MaxPooling2D</span><span class="p">((</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)))</span>
<span class="n">network</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Flatten</span><span class="p">())</span>
<span class="n">network</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'relu'</span><span class="p">))</span>
<span class="n">network</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">keras</span><span class="o">.</span><span class="n">layers</span><span class="o">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">6</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s1">'softmax'</span><span class="p">))</span>
<span class="k">return</span> <span class="n">network</span>
</code></pre>
<p>We then use this to create a network based on the shape of the first digit
slice.</p>
<pre class="code literal-block"><span></span><code><span class="n">network</span> <span class="o">=</span> <span class="n">make_network</span><span class="p">(</span><span class="n">train_images</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">:])</span>
<span class="n">network</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="n">optimizer</span><span class="o">=</span><span class="s1">'rmsprop'</span><span class="p">,</span>
<span class="n">loss</span><span class="o">=</span><span class="s1">'categorical_crossentropy'</span><span class="p">,</span>
<span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="s1">'accuracy'</span><span class="p">])</span>
</code></pre>
<p>The model is summarized by Keras in the following table:</p>
<table>
<thead>
<tr>
<th>Layer (type)</th>
<th>Output Shape</th>
<th>Param #</th>
</tr>
</thead>
<tbody>
<tr>
<td>conv2d_13 (Conv2D)</td>
<td>(None, 78, 52, 32)</td>
<td>320</td>
</tr>
<tr>
<td>max_pooling2d_13 (MaxPooling)</td>
<td>(None, 39, 26, 32)</td>
<td>0</td>
</tr>
<tr>
<td>conv2d_14 (Conv2D)</td>
<td>(None, 37, 24, 64)</td>
<td>18496</td>
</tr>
<tr>
<td>max_pooling2d_14 (MaxPooling)</td>
<td>(None, 18, 12, 64)</td>
<td>0</td>
</tr>
<tr>
<td>conv2d_15 (Conv2D)</td>
<td>(None, 16, 10, 128)</td>
<td>73856</td>
</tr>
<tr>
<td>max_pooling2d_15 (MaxPooling)</td>
<td>(None, 8, 5, 128)</td>
<td>0</td>
</tr>
<tr>
<td>conv2d_16 (Conv2D)</td>
<td>(None, 6, 3, 128)</td>
<td>147584</td>
</tr>
<tr>
<td>max_pooling2d_16 (MaxPooling)</td>
<td>(None, 3, 1, 128)</td>
<td>0</td>
</tr>
<tr>
<td>flatten_4 (Flatten)</td>
<td>(None, 384)</td>
<td>0</td>
</tr>
<tr>
<td>dense_7 (Dense)</td>
<td>(None, 512)</td>
<td>197120</td>
</tr>
<tr>
<td>dense_8 (Dense)</td>
<td>(None, 6)</td>
<td>3078</td>
</tr>
</tbody>
</table>
<p>And the number of parameters is 440,454 in total.</p>
<p>Then I fit the model and using the training data and validate with the
validation data to see whether I run into overfitting.</p>
<pre class="code literal-block"><span></span><code><span class="n">history</span> <span class="o">=</span> <span class="n">network</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span>
<span class="n">train_images</span><span class="p">,</span> <span class="n">train_labels</span><span class="p">,</span>
<span class="n">epochs</span><span class="o">=</span><span class="mi">40</span><span class="p">,</span>
<span class="n">batch_size</span><span class="o">=</span><span class="mi">128</span><span class="p">,</span>
<span class="n">validation_data</span><span class="o">=</span><span class="p">(</span><span class="n">validation_images</span><span class="p">,</span> <span class="n">validation_labels</span><span class="p">))</span>
</code></pre>
<p>We can visualize the history of the training process with the value of the loss
function as well as the accuracy on both the training and validation data. The
lower gray line in the accuracy plot marks $1/6$ which would be the accuracy of
guessing which of the six digits is correct. The upper one is 1, which is
perfect accuracy.</p>
<p><img alt="" src="https://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/history-3.svg"></p>
<p>After 40 epochs we obtain a training accuracy of 1.0000, which certainly smells
like overfitting. But the validation accuracy is also high, so apparently we
are on a good track. In order to better see it, one should plot the error ratio
in a log scale. The points with perfect accuracy drop out in such a plot, so
focus on the validation there.</p>
<p><img alt="" src="https://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/history-4.svg"></p>
<p>In the end we reach a validation accuracy of 0.9980, so we clearly have a
network which is general enough to solve examples that it has not seen before.
Also on the separate test data we have 0.9980 accuracy.</p>
<p>We need to get six digits right, so the probability of getting all six digits
right is $0.9980^6 = 0.9881$. This assumes that the digit recognition for the
other five slices is similar, which I am willing to make. The number is
impressively high given that I just took a recipe from the book and applied it
without tuning the network in any way.</p>
<p>The training takes around 9 seconds per epoch, as we reach sufficient real
world accuracy after like 25 epochs, this is done in a couple of minutes on my
laptop CPU. It seems that this problem is solved just right there and the
captcha is already cracked. And for the 1.2 % of the cases where we do not
solve the captcha right on the first go, we just request a new one. With an
error rate that low we would not raise any suspicion.</p>
<h2 id="fiddling-with-the-network">Fiddling with the network</h2>
<p>Although there is no need to, we can investigate whether we could solve the
problem with less resources. To start with, I have just shrunken the dense
layer (which is part of the classifier) from 512 to 128. Also I have stopped
after 30 epochs. The results in the end are virtually the same. We just happen
to have an fluctuation in the 30th epoch, in the 29th it seems to have very
similar accuracy on the validation data. Time has only gone down to 8 seconds
per epoch, so we are not really much faster by using the smaller dense layer in
the end.</p>
<p><img alt="" src="https://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/history-5.svg"></p>
<p>Next I have tried to remove one of the blocks consisting of convolution and
maxpooling layers. Interestingly the number of parameters in the model goes up
to 748,934. Time is still 8 seconds per epoch, so that does not change
anything. The resulting accuracy also seems to be quite okay, but it might need
a few more epochs in order to get it to the level of the other ones.
Interestingly there seems to be a lack of these fluctuations, I take that to be
a good sign. The model likely now has less free parameters such that it learns
the features better and does not learn so much from the noise.</p>
<p><img alt="" src="https://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/history-6.svg"></p>
<p>Perhaps adding some dropout will improve the system even more. Just after the
dense layer with 128 elements I have added a dropout layer with a 0.5
coefficient. It seems that it does not really improve anything.</p>
<p><img alt="" src="https://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/history-7.svg"></p>
<p>The convolutional layer uses 3×3 tiles. The maxpooling layer then takes a 2×2
tile and takes the maximum out of that. So basically we cover a 4×4 area with
one convolutional block. The next convolutional layer similarly acts the same
and with $n$ blocks we have a square with side length $4^n = 2^{2n}$. The
letters in the images are roughly 16 pixels of size, so we should be able to
get sufficient information with just two blocks. But then the model has
1,789,190 parameters. This sound like the wrong direction. So I take two
blocks, but make the last maxpooling layer to have 5×5 maxing. I already know
that there is only one digit in the whole picture, so perhaps we could average
over larger areas. This just takes 5 seconds/epoch to fit, but the results are
horrible.</p>
<p><img alt="" src="https://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/history-8.svg"></p>
<p>Apparently I am moving into the wrong direction. Although the features that I
want to detect are small, the image where they lie on is rather large. So at
the moment I do the locating of the digit with the dense network. Let us
therefore go back to the four convolutional blocks. We are back at the model
with 290,310 parameters now.</p>
<p>I want to see what happens when I use less data. So instead of using 1000
images for training, let's just use 100. There are six digits, but with
rotation. So perhaps 100 would mean just 16 observations of each, that might
not be sufficient to gather all the rotations. The training is much faster now
as there is just 10 % of the data to process. Validation is still done with 500
images to yield the same precision there. The results are very poor compared to
before:</p>
<p><img alt="" src="https://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/history-9.svg"></p>
<p>We can also see that there is overfitting. The accuracy for the training data
is much better than for the validation data. Perhaps we can try to use dropout
here to regularize it and prevent it from overfitting. But as one can see in
the following plot, the single dropout layer does not magically cure that there
just is not enough data.</p>
<p><img alt="" src="https://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/history-10.svg"></p>
<h2 id="conclusions">Conclusions</h2>
<p>It is refreshingly simple to play around with deep neural networks using the
Keras library. For this blog entry I have just fiddled around with it for a
day, before that I read in the book.</p>
<p>The captcha can be solved rather quickly, no wonder that this type is not used
that much any more these days.</p>
<p>One could likely improve the accuracy of this process further, but I just do
not see a point because for all practical purposes it is more than sufficient.</p>
<p>You can download the <a href="https://github.com/martin-ueding/simple-captcha-with-deep-neural-network">source
code</a>
from GitHub.</p></div>EnglishKerasMachine Learninghttps://martin-ueding.de/posts/simple-captcha-with-deep-neural-network/Thu, 21 May 2020 22:00:00 GMT