Machine Learning and Quantum Mechanics
Machine learning and quantum mechanics have nothing in common physically. However, they are based on very similar building blocks from mathematics.
As part of the physics lectures that I have visited, I also had a bunch of math lectures. We learned about basic mathematical concepts like fields, sets, how proofs works. We then had various lectures about linear algebra and multivariate calculus.
From the fourth semester onwards we covered quantum mechanics. There things stop being deterministic. The only deterministic thing is the evolution of probability distributions. So you can express probabilities for outcomes with absolute precision, you just cannot say much about each particular event.
There are two ways to look at quantum mechanics. There is the Schrödinger approach which uses differential equations. One thinks of the wave function $\psi(x)$ as a function in $\mathbb R^3$ space. The Schrödinger equation is given as a differential equation. For the harmonic oscillator (simple spring pendulum) it looks like this: $$ E \psi(x) = -\frac{1}{2m} \psi''(x) + \frac{m \omega^2}{2} x^2 \psi(x) \,. $$
This is a differential equation in $\psi(x)$ that we can solve. In order to get a sensible solution, we need to add another constraint, namely that the function $\psi$ is normalizable, meaning that $$ N = \int_{-\infty}^\infty \mathrm dx \, |\psi(x)|^2 $$ is a finite number. We can then divide by this $N$ and have normalized the integral of the function squared to 1.
This constraint will cause the energy $E$ to be discrete. After solving it, one will find discrete energy levels of $E = n + 1/2$ where $n$ is a non-negative integer.
This solution is pretty cumbersome to do, it takes a lot of fiddling with differential equations to get it done. The structure that we get out of it also doesn't really speak for itself at first. But there are hints that there is a bigger structure at play.
The other view onto quantum mechanics is the one of linear algebra. The functions which are square integrable to yield a finite normalization actually form a vector space. The “vectors” of this vector space aren't the usual 3D vectors that we have seen in high school, they are functions. The axioms of a vector space also apply to functions, so we can have a vector space of functions. As we also have an inner product, this is a Hilbert space.
The connection to the functions is given by the definition of the inner product: $$ \langle \psi, \phi \rangle := \int_{-\infty}^\infty \mathrm dx \, \psi^*(x) \phi(x) \,. $$
Physicists usually write this inner product as $\langle \psi | \phi \rangle$ and then write $|\phi\rangle$ for the vectors and $\langle\phi|$ for covectors.
This linear algebra notation then allows to express a lot of things more concise. The normalization is just $\langle \psi | \psi \rangle < \infty$.
The solutions to the Schrödinger equation, the functions $\psi_n(x)$, actually are orthogonal to each other. We write this as $\langle \psi_n | \psi_m \rangle = \delta_{mn}$ where $\delta_{mn}$ is one if both indices are the same, and zero otherwise. This condition is cumbersome to write with the functions, one has to write it as an overlap integral.
Orthogonal functions form a basis of that Hilbert space. So we directly know that all linear combinations of the solution functions are also solutions again. We can use all the linear algebra machinery to work with these solutions.
At the same time the wave function serves as a probability density function. We have $p(x) = |\psi(x)|^2$ and then can use that to predict things. The expectation value of an observable $f(x)$ given a probability distribution $p(x)$ is always given as $$ E(x) = \int \mathrm dx \, f(x) p(x) \,. $$ This holds in any field which uses statistics.
Deep learning
Now we can take a look at deep learning. From the outset, it is a completely different thing. We take models and fit them to data. Along the way they pick up features that seem helpful.
If we just take a look at dense layers (also called multi-layer perceptron), we find that these are just matrices, a bias vector and a non-linear activation function. The latter two work on each element individually; the matrix is in the realm of linear algebra.
In order to train a model, one uses the gradient descent algorithm. We take the derivative of the loss function with respect to each parameter in the neural network. Because we have a chain of layers, we need to use the chain rule to pass the gradient through all the intermediate layers.
The “time evolution” of the system can be pictured as descending into the valley along the steepest path that we can find. Physically that corresponds to a ball rolling down a hill. It will continue rolling until it has found a local minimum.
In the mathematical description we have an iteration prescription that could be read like a differential equation: $$ \dot w = - \alpha w' \,. $$
We have this $\alpha$ called the learning rate and it is a constant (at least if we are not using the ADAM optimizer). If one would look at a physical system with a ball running down a hill we would have something similar. The negative gradient is the force and that creates an acceleration.
Machine learning is an inherently statistical field. Therefore one needs to work with probability densities and mass functions all the time. Calculating expectation values and variances is necessary to determine whether a system has the chance to converge at all.
Commonalities
When I started to look into machine learning around three years ago, I had a pretty easy start. Over the years I have realized just how much common ground quantum mechanics and deep learning have. There is a lot of linear algebra, a bit of multivariate calculus and lots of statistics to look at.
If one has learned quantum mechanics, it is rather straightforward to learn the technicalities of deep learning. Building an intuition requires hands-on experience but will come over time.
So if you're currently doing physics, then deep learning isn't really far away if you want to look into that field.