import Pkg
Pkg.activate(".")
using HDF5, LinearAlgebra, Plots, Flux, Lazy, Distributions, Statistics
import Random: seed!
Activating project at `~/Sites/OPT-blog/posts/dictionary-learning`
I want a GPU for Christmas!
March 17, 2021
February 20, 2024
The conditional gradient method seems an effective way to approximate a vector as a sum of atoms from a given atomic set. What if we don’t know what’s the right atomic set for the decomposition of vectors from a large dataset? This notebook explores the feasibility of using techniques from automatic differentiation and stochastic gradient descent to construct an atomic set.
Here’s a highly-simplified setup which let’s me experiment with tools from the Flux.jl ecosystem.
Let \(\{b_i\}_{i=1}^N\subset\mathbb{R}^m\) be the dataset we wish to decompose. We consider the optimization problem \[ \min_M\ \min_y\left\{\frac1N\sum_{i=1}^N\|y-b_i\|\ \mid \gamma_{M\mathbb{B}_1}(y)\le \tau\right\}, \] where \(M\) is an \(m\)-by-\(n\) matrix, and \(\mathbb{B}_1=\{\pm e_i\}_{i=1}^n\subset\mathbb{R}^n\) is the set of canonical unit vectors. Then the mapped unit ball \(M\mathbb{B}_1\) is simply set signed columns of \(M\), and we see that the problem aims to compute the columns of \(M\) that give a good sparse representation of the vectors \(b_1,\ldots,b_N\).
If we assume that \(M\) is nonsingular, then the problem is equivalent to \[ \min_M\ \min_y\left\{ \frac1N\sum_{i=1}^N\|Mx-b_i\|\ \mid\ \gamma_{\mathbb{B}_1}(x)\le \tau\right\}. \]
In this first draft, I just want to get a handle on using the Flux tools, and so will entirely ignore the constraint. We can come back to that later.
import Pkg
Pkg.activate(".")
using HDF5, LinearAlgebra, Plots, Flux, Lazy, Distributions, Statistics
import Random: seed!
Activating project at `~/Sites/OPT-blog/posts/dictionary-learning`
Consider the least-squares loss \[ f_{M,b}(x):=\tfrac12\|Mx-b\|^2, \] where \(M\) and \(b\) are fixed. We can implement various generalizations of a descent method, where for each iterate \(x\), the negative gradient \(z=-\nabla f_{M,b}(x)\) is pushed through the subgradient map of a convex function that selects an aligning atom. Some examples are given below.
But first, here’s the descent method atomicdescent
. The parameter depth
controls the number of iterations.
function atomicdescent(ball, M, b; depth=10)
x = zeros(length(M'b))
r = copy(b); f = [0.5*norm(r)^2]
for k = 1:depth
a = ball'(M'r) # <--- many aligning choices! See below.
Ma = M*a
α = dot(Ma, r)/dot(Ma, Ma) # <--- exact linearch optional.
α < 0 && break
x = x + α*a
r = r - α*Ma; f = [f; 0.5*norm(r)^2]
end
return 0.5*norm(r)^2, x, f
end;
We can implement various standard methods by choosing a corresponding function for the aligning function ball
:
Here are the kernel function definitions.
Verify on a small random problem that the algorithm works as expected.
This experiment will try to learn the basis for a linear subspace. This is a pretty silly experiment since there are much more direct ways to do this. (See the Procrustes problem below.) But the main aim here is to test that we can actually differentiate through atomicdescent
and apply the stochastic gradient method.
Let the columns of the \(n\)-by-\(k\) orthonormal matrix \(Q\) span this subspace. Then we can generate samples b
from this space via
\[ b = Qv \]
by sampling normalized Gaussian vectors \(v\).
Define the Subspace
type, a constructor that produces an \(k\)-dimensional linear subspace embedded in \(\mathbb{R}^m\). Make it callable.
We define three loss functions that will be used for the training:
loss(M, b)
computes the squared Euclidean distance between the vector b
and the range of M
.loss(M)
computes the empirical loss between random samples drawn from Subspace
(defined above) and the range of M
.atomicloss(M, b)
returns the loss computed by atomicdescent
for a given M
and b
.Apply stochastic gradient descent to compute an approximation to the subspace. We’ll use this small random data set.
Train the model.
The subspace learning example above is a toy, and the atomicdescent
implementation above won’t allow for ball
kernels defined on the extended reals, such as the gauge to the simplex, which would allow for nonnegativity.
Some things to try:
The Swimmer dataset (Donoho and Stodden 2003) could be a good test set.
swimmers = h5open("Swimmer.h5","r") do dataset
s = convert(Array{Float32,3}, read(dataset)["Y"])
# shift the configuration index to the end (needed for Flux)
s = permutedims(s, (2,3,1))
# normalize each image
s /= 255
maxper = maximum(reshape(s, prod(size(s)[1:2]),:))
return broadcast(/, s, maxper)
end
println("The set contains $(size(swimmers)[3]) swimmer configurations, each of size $(size(swimmers)[1:2]).")
The set contains 256 swimmer configurations, each of size (32, 32).
Here are the first 25 swimmer configurations in the dataset.
\[ Q = \mbox{arg\,min}\left\{\,\|QA-B\|_F \mid Q^TQ=I\,\right\} \]