Title
Authors
Laurens van der Maaten, Geoffrey Hinton
Summary
Introduction
tools to display more than 2 data dimensions, and leave the interpretation of the data to the human observer.
Dimensionality reduction methods convert the high-dimensional data $\mathcal{X} = {x_1,…, x_n}$ where $x_i \in R^m$ into 2 or 3 dimensional data $\mathcal{Y} = {y_1,…, y_n}$ that can be displayed in a scatterplot. We refer $\mathcal{Y}$ as a map, and $y_i$ as map points.
Aim: preserve as much of the significant structure of the high-dim data as possible in the low-dim map.
- linear mapping (PCA, MDS, …)
- nonlinear mapping
- aim to preserve the local structure: Sammon mapping; curvilinear components analysis(CCA), Stochastic Neighbor Embedding(SNE); Isomap; Maximum Variance Unfolding(MVU); Locally Linear Embedding(LLE); Laplacian Eigenmaps)
- retaining both the local and the global structure of the data in a single map: t-SNE
Stochastic Neighbor Embedding(SNE)
converting the high-dim Euclidean distances between data points into conditional probablities that represent similarities. [kind of “Confusion data”] The similarity of data point $x_j$ to $xi$ is the conditional probability $p{j|i}$ that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$.
For high-dim, [Bayesian] the conditional probability $p{j|i}$ is given by $p{j | i}=\frac{\exp \left(-\left|x{i}-x{j}\right|^{2} / 2 \sigma{i}^{2}\right)}{\sum{k \neq i} \exp \left(-\left|x{i}-x{k}\right|^{2} / 2 \sigma_{i}^{2}\right)}$ where $\sigma_i$ is the variance of the Gaussian that is centered on $x_i$. $\sigmai$ is to be determined. $p{i | i} = 0$.
For low-dim, set the variance of the Gaussian of $q{j | i}$ to $\frac{1}{\sqrt{2}}$, then we have $q{j | i}=\frac{\exp \left(-\left|y{i}-y{j}\right|^{2}\right)}{\sum{k \neq i} \exp \left(-\left|y{i}-y{k}\right|^{2}\right)}$ , $q{i | i} = 0$.
If the map points $y_i, y_j$ correctly model the similarity between $x_i, xj$ , $p{j | i} == q{j | i}$. Thus, SNE aims to find a low-dim data representation that minimizes the mismatch between $p{j | i}$ and $q_{j | i}$.
A natural measure is Kullback-Leibler divergence (which is in this case equal to the cross-entropy up to an additive constant). SNE minimizes the sum of KL over all data points using a gradient descent method. Cost function $C(y_{i}) = \sum_i KL(P_i || Q_i) = \sum_i \sumj p{j | i}\log \frac{p{j | i}}{q{j | i}}$, where $P_i, Q_i$ represent the conditional probability distribuiton given $x_i$ or $y_i$.
KL divergence is not symmetric, different type of error in the pairwise in the low-dim map are not weighted equally. there is a large cost for using a small $q{j | i}$ to model a large $p{j | i}$, but only a small cost for using large $q{j | i}$ to model small $p{j |i}$. The SNE cost function focuses on retaining the local structure of the data in the map.
$\sigma_i$ parameter
It’s not likely that there is a single value of $\sigma_i$ that is optimal for all data points in the data set because the density of the data is likely to vary. The distribution has an entropy which increases as $\sigma_i$ increases.
SNE performs a binary search for the value of $\sigma_i$ that produces a $P_i$ with a fixed perplexity that is specified by the user. $perp(P_i) = 2^{H(P_i)}$ where Shannon Entropy $H(P_i) = -\sumj p{j|i} \log2 p{j|i}$.
perplexity can be interpreted as a smoothed measure of the effective number of neighbors. SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.
optimization problem
The minimization of cost function is performed using a gradient descent method. The gradient $\frac{\delta C}{\delta y{i}}=2 \sum{j}\left(p{j | i}-q{j | i}+p{i | j}-q{i | j}\right)\left(y{i}-y{j}\right)$, and may be interpreted as the resultant force created by a set of springs between the map point $y_i$ and all other $y_j$. All springs exert a force along the direction $y_i - y_j$, and the spring repels or attracts the map points depending on whether the distance in the map is too small or too large the represent the similarities between two high-dim points. The force is also proportional to its stiffness, which is the mismatch ($p{j | i} - q{j | i} + p{i | j} - q{i | j}$,
Gradient descent
initialized by sampling map points randomly from an isotropic Gaussian with small variance that is centered around the origin.
In order to speed up the optimization and avoid poor local minima, a relatively large momentum term is added to the gradient, that is, the current gradient is added to an exponentially decaying sum of previous gradients in order to determine the changes in the coordinates of the map points at each iteration of the gradient search.
$\mathcal{Y}^{(t)}=\mathcal{Y}^{(t-1)}+\eta \frac{\delta C}{\delta \mathcal{Y}}+\alpha(t)\left(\mathcal{Y}^{(t-1)}-\mathcal{Y}^{(t-2)}\right)$ , where $\mathcal{Y}^{(t)}$ indicates the solution at iteration $t$, $\eta$: learning rate, $\alpha(t)$ represents the momentum at iteration $t$.
In the early stages, Gaussian noise is added to the map points after iteration, gradually reducing the variance of this noise helps to escape from poor local minima in the cost funciton.
But tuning the variance of the noise require the optimization begin run serveral times to find appropriate values. In this respect, SNE is inferior to methods allowing convex optimization that without requiring extra computation time and parameter choices introduced by simulated annealing.
t-Distributed SNE
Symmetric SNE
replace KL betwen conditional probabilties to KL between joint probability distribution, $P$, $Q$.
$C=K L(P | Q)=\sum{i} \sum{j} p{i j} \log \frac{p{i j}}{q{i j}}$. where $p{i j}=\frac{\exp \left(-\left|x{i}-x{j}\right|^{2} / 2 \sigma^{2}\right)}{\sum{k \neq l} \exp \left(-\left|x{k}-x_{l}\right|^{2} / 2 \sigma^{2}\right)}$
defining $q{i j}=\frac{\exp \left(-\left|y{i}-y{j}\right|^{2}\right)}{\sum{k \neq l} \exp \left(-\left|y{k}-y{l}\right|^{2}\right)}$ will cause problems when a high-dim $x_i$ is an outlier (i.e., all pairwise distances are large for $xi$, $p{ij}$ are extremely small for all $j$, then $y_i$ has very little effect on the cost function, then the map point is not well determined. ).
This problem can be solved by defining the joint probabilities $p_{ij}$ in the high-dim to be the symmetrized conditional probability, that is, $p{ij} = \frac{p{j | i} + p_{i | j}}{2}$. This also ensures that $\sumj p{ij} > \frac{1}{2n}$ for all $x_i$, then each $x_i$ makes significant contribution to the cost function.
The gradient of symmetric SNE is $\frac{\delta C(y{i})}{\delta y{i}}=4 \sum{j}\left(p{i j}-q{i j}\right)\left(y{i}-y_{j}\right)$.
In preliminary experiments, symmetric SNE seems to produce maps that are just as good as asymmetric SNE, and sometimes even a little better.
The Crowding Problem
In 10 dimensions, it’s possible to have 11 data points that are mutually equidistant and there is no way to model this faithfully in 2-dim map.
crowding problem: the area of the 2-dim map that is available to accommodate moderately distant datapoints will not be nearly large enough compared with the area available to accommodate nearby datapoints.
Hence, most of the points that are at a moderate distance from datapoint $i$ will have to be placed much too far away in the 2-dim map.
crowding problem occurs in local techniques for multidimensional scaling such as SNE and Sammon mapping.
An attempt to address this by adding a slight repulsion to all springs, by introducing a uniform backgrand model with a small mixing proportion $\rho$. [UNI-SNE method]. The best optimaization method known is to start by setting $\rho = 0$, and increase to allow some gaps to form between natural clusters. If 2 parts of a cluster get separated early on in the optimization, there is no force to pull them back together.
t-SNE
Mismatched tails can compensate for mismatched dimensionalities
In high-dim, convert distances into probability by Gaussian; while in low-dim, use a probability distribution that has much heavier tails than Gaussian. It eliminates the unwanted attractive force between map points that represent moderately dissimilar datapoints.
In t-SNE, employ t-distribution with 1 df, which is Cauchy.
$q{i j}=\frac{\left(1+\left|y{i}-y{j}\right|^{2}\right)^{-1}}{\sum{k \neq l}\left(1+\left|y{k}-y{l}\right|^{2}\right)^{-1}}$
the nice property that $\left(1+\left|y{i}-y{j}\right|^{2}\right)^{-1}$ approaches an inverse square low for large $\left|y{i}-y{j}\right|^{2}$ in the low-dim map. This makes the map’s representation of joint probabilities (almost) invariant to changes in the scale of the map for map points that are far apart.
student t-distribution is an infinite mixture of Gaussians.
Computationally convenient: it’s much faster to evaluate the density under t-dist than Gaussian because it doesn’t involve an exponential.
Ensures Robustness, reduces influence by outlier, and captures better global features. from zhihu
gradient of KL divergence: $\frac{\delta C}{\delta y{i}}=4 \sum{j}\left(p{i j}-q{i j}\right)\left(y{i}-y{j}\right)\left(1+\left|y{i}-y{j}\right|^{2}\right)^{-1}$ , shown in Appendix in original paper.
Two Main advantages
As shown in the Figure,
- t-SNE gradient strongly repels dissimilar datapoints that are modeled by small pairwise distance in low-dim.
- these repulsions do not go to infinity, which will not cause dissimilar datapoints to move much too far away from each other.
- The optimization of t-SNE cost function is much easier than SNE or UNI-SNE. t-SNE introduces long-range forces in low-dim map. SNE or UNI-SNE needs simulated annealing to obtain reasonable solution. long-range forces in t-SNE facilitate the identification of good local optima without resorting to simulated annealing.
Optimization for t-SNE
gradient descent:
can be sped up using adaptive learning rate scheme by increasing the learning rate in directions in which the gradient is stable.
can be improved further by using either of 2 tricks:
- early compression
adding an additional $L_2$-penalty to cost funciton that is proportional to sum of squared distances from the origin.
- early exaggeration
Multiply all of the $p_{ij}$ by, for example, 4.
effect: the natural clusters in the data tend to form tight widely separated clusters in the map, which creates a lot of relatively empty space to make the clusters easier to move around to find a good global organization.
Experiments
compare t-SNE with 7 other non-parameteric techniques for dim reduction.
setup
In all experiments, start by using PCA to reduce the dim to 30. This speeds up the computation of pairwise distances and suppresses some noise without severely distorting the interpoint distances. Then use techniques to convert 30-dim to 2-dim as a scatterplot.
result
t-SNE is much better.
Applying t-SNE to large data set
Discussion
Conclusion
Reference
Visualizing Data using t-SNE, Laurens van der Maaten, Geoffrey Hinton; 9(Nov):2579–2605, 2008.