Reimplementing a gauge equivariant icosahedral CNN for spherical images

Hailey
7 min readAug 29, 2019

Introduction

Traditional convolutional neural network (CNN) can deal with rectangular images and its success stems from exploiting translational symmetry induced from the flat plane on which the images lie. So if an input feature is translated, the corresponding output features will be likewise translated, which is done by translating the same filters across the input. For images on other manifolds, we want a similar CNN that can exploit the symmetries of these spaces. Cohen et al. [1] propose a new kind of CNN based on gauge theory in physics, and implement it on an icosahedron approximating a sphere. We follow their paper to reimplement a neural network for classifying digits on images in MNIST dataset projected on the sphere.

The basic idea of gauge theory is that measuring an intrinsic physical quantity should be independent of the coordinates used to take the measurement. So if you change the coordinates used, the only difference that shows up in the measurement should only be a change corresponding to the change in coordinates. A gauge transform is a change in the coordinate basis of tangent space at each point of a manifold, and the idea of gauge equivariance is that measurements are changed according to the gauge transform. For example, if you measure the velocity of a particle with a certain basis of three unit vectors of the space, the measurement is a coordinate vector. and if you change the basis, like rotating the basis or use a different unit of measurement, you will get a different coordinate vector, but it still expresses the same velocity of the particle. The collection of these coordinate transformation forms something called a group, which is an algebraic object used to study symmetries. Gauge equivariant CNN takes such transformations in the input features into account and produces features that also transform accordingly by using filters with a built-in symmetry.

Preparing the dataset

For each image in MNIST dataset, we apply a random 3D rotation and project it to the sphere from the vector perpendicular to the rotation plane. The next step is to convert this projection into a format that is easy for computer to handle. Cohen et al. [1] approximate the sphere by an icosahedron, which is a regular polyhedron with 20 triangular faces. Of the five Euclidean solids, the icosahedron is the closest to the sphere in the sense that when projected to the sphere from the center, the metric is least distorted. (There is still a remarkable difference that the sphere is positively curved while the icosahedron is flat almost everywhere except at the vertices which have positive infinite curvature, which are called singular points.)

The icosahedron is then covered with five slightly overlapping charts, each of which is a parallelogram made up of 4 triangular faces with thin strips attached to two of its sides. They put a grid on the icosahedron by successively subdividing each triangular face into 4 smaller triangles. We want to be more flexible with the number of grid points, so we choose instead to subdivide each side of the triangular face into n subintervals and obtain a total of 10 n² + 10 grid points on the icosahedron. (We use n = 10.) Unlike flat triangles, it is impossible to divide spherical triangles into smaller congruent triangles because there is no scaling transformation on the sphere as on Euclidean space. So these smaller triangles obtained above are not identically shaped when projected to the sphere, the ones closer to the center of the face are larger than the ones closer to the edges of the face. Nonetheless, the grid points are roughly uniformly distributed on the sphere and not too hard to calculate. We obtain the value of the projected images at these grid points by linear interpolation and store them in a rectangular array.

An icosahedron with one face subdivided into 100 smaller triangles
A chart (blue) on the sphere made up of 4 triangles and 2 strips (shown as curves) that overlap with another chart (green)

Constructing the gauge equivariant CNN

Now we move to constructing the gauge equivariant CNN. Each point in the grid has 6 neighbors, except for the 12 vertices of the icosahedron, called corners in [1], each of which has 5. The sliding window that we are going to use for the CNN is a 1-ring hexagon which consists of a point at the center and its 6 neighbors. This window is embedded into a 3 x 3 array. So the filters are functions defined on this array. The 12 corners are ignored, the effect of which will spread to nearby grid points down the layers, but this can be remedied by using a denser grid.

The effect of ignoring corners on the CNN. We use 2 grids, the left one with 10 subdivisions and the right one with 20. We put a constant input feature of 1 on them, and use a filter that takes the average of values at each point and its hexagonal 1-ring. We plot the input and the output after 50 layers of convolutions side-by-side. Ideally, there should not be any difference between the input and the output.

Every feature in the network can either be a scalar- or a (multi-dimensional) vector-valued field. For the traditional CNN, we need not deal with rotational symmetry, so a scale feature suffices, and a vector feature in the CNN is nothing but a stack of different scalar features. However, for the icosahedron, we need to take care of local symmetry for the grid points generated by an order 6 rotation, and vector features are better than scalar features for the task. To see this, suppose that a feature is the differences in values between the value of a grid point and the values of its 6 neighbors. We obtain six values for this grid point. If we ignore the rotation, then we can just store them as six different scalar features; but if we take the rotation into account, it would be better to store them as a 6-dimensional vector feature, and cyclically permute its coordinates when we rotate the 6 neighbors . Except for the input feature which is scalar by nature, the features in the gauge equivariant CNN are all 6-dimensional vector features, each dimension corresponds to an element of the local symmetry group. Such vector features are called regular features, a term which comes from regular representation in the mathematical theory of group representations.

Padding

To keep the size of the underlying grid unchanged across features, we need to pad the charts. For the traditional CNN, they are just zero-padded as there is nothing outside the rectangular region, but here we need to pad the features with feature from adjacent grid points on the icosahedron, which form the 2 strips in the charts. Since we are going from one chart to another, we cannot just copy the regular features from these grid point because they are from another chart with respect to a different basis, i.e. a different gauge. So we need to transform the regular features before copying, which is a cyclic permutation of coordinates.

Filter expansion for gauge equivariance

To preserve symmetry in the output, on top of translating the same filters across the image, we also need to make sure that when the input is rotated, the output is also rotated. This puts a symmetry constraint on the filter, so the filter is expanded to satisfy this constraint.We hardcode the filter expansion in the program, one for scalar-to-regular and one for regular-to-regular, following the outlines in the paper.

Now we can feed the expanded filters into the usual CNN which takes care of the transitive symmetry of the sphere through “translational symmmetry” of the plane. After the CNN is performed, the values at the chart boundaries of the output features are padded again in the same way as above, since even in the interior of the atlas the convolution uses wrong neighboring vertices at the chart boundaries.

Augmenting the dataset by icosahedral symmetries

To let our CNN learn all the 60 (global) symmetries of the icosahedron, we expand our training set by these symmetries. So we hardcode 3 symmetries in terms of indices of the atlas, namely rotations with respect to a vertex, to an edge and to a triangular face, and express the other symmetries as words in terms of these 3 symmetries. It can be easily seen that these 3 symmetries generate the symmetry group. We write a function to find out the words.

Transform an image (each shown as an atlas) by 60 symmetries of icosahedron
The original image on the sphere (flattened)

CNN training and testing result

We use a smaller network than the one in the paper, since our focus is not in exactly reproducing the paper’s results, but in learning how the gauge equivariant CNN works. Our CNN has 3 convolution layers and 2 fully connected layers. The first layer is a scalar-to-regular layer and the next two are regular-to-regular, and they have 4, 8, 10 output channels respectively and stride 1. The fully connected layers have 200 and 10 channels respectively. We train our CNN on 40 000 projected images from the MNIST training set and test it on 10 000 projected images also from the MNIST training set. We run 60 epochs for the training, and for each epoch we use one of the 60 icosahedral symmetries to transform the training data. We get a fair performance of about 88.6% accuracy.

[1] T. S. Cohen, M. Weiler, B. Kicanaoglu, and M. Welling. Gauge equivariant convolutional networks and the icosahedral CNN. arXiv:1902.04615, 2019.

--

--