Spectahedron Manifold Word Embeddings

Multi-sense information of a polysemous word is well conveyed through the geometry of Spectahedron Manifolds

Multi-sense embedding structure

Unsupervised global word embedding models like GloVe and Word2Vec map each word to a single point in a vector space with. As a result of that, all possible meanings of a polysemous word present in the corpus are conflated into a single representation. This phenomenon is termed as meaning conflation deficiency which causes some unwanted semantic modelling behaviour like pulling two seemingly unrelated words close together in the latent space. We take the example of two words: “rat” and “screen” to demonstrate that. They are pulled towards each other in the embedding space because both words are synonymous with the word “mouse” under two different senses. If a distinction is not made between the senses, they have negative effects on the accuracy of semantic modelling. One popular solution to this problem is to use sense representation i.e. each individual sense of a word present in a corpus is encoded with an independent vector embedding My goal is to look for a well thought out word representation model that can tackle this multi-sense problem by having a single mode of representation for each polysemous word and then through a computationally efficient function, we produce a single vector representation for each sense. One possible way to do that would be to encode the words as points in a Grassmannian manifold, which are essentially \(r\)-dimensional subspace for some \(r \in \textbf{N}\). Points in a Grassmannian manifold are subspaces which can be computationally represented as orthonormal matrices of size \(d \times r\) where \(d\) is the dimension of the entire vector space and \(r\) is the dimension of the subspace. Since the senses of a word are largely unrelated to each other, we conjectured that each non-zero linearly independent eigenvector that makes up the subspace would represent an individual sense of the word. Therefore, a full rank matrix of rank \(r\) should encode \(r\) different senses of the word. However, that along won’t be enough. We need to quantify the occurence of each sense in the corpus with “weights”. A particular sense will be more prominent if it occurs more frequently in the corpus. In fact, there is a high possibility of one sense of a word completely dominating another owing to its higher occurrence in the corpus. So, it becomes necessary to normalize the frequency of occurrence of all the senses associated with the word. Mathematically, these weights are the corresponding eigenvalues of the eigenvectors. I propose an efficient model to train words in a Spectahedron manifold and provide some weak qualitative results that show that polysemous words could benefit from modelling words as point in the Spectahedron manifold.

The issue of polysemy in contextualised embeddings have been addressed before through sense representation models. proposes the multi-sense skip-gram model which builds upon the original skip-gram architecture of Word2Vec by clustering together the similar contexts in which a focus word appears and having individual vector representation for each context cluster. These individual representations are the sense representations of that word. They also added flexibility by having a parametric and non-parametric version of their model. Similarly, proposed a non-parametric Bayesian version of the Skip-gram model that turns out to be capable of automatically learning the required number of senses for each word in the corpus.

takes pre-trained embeddings and “de-conflate” them using an external semantic inventory like WordNet. They also show that in downstream tasks like topic categorization, multi-sense representations outperform pre-trained single vector word embeddings. also does something similar by proposing a topic modeling based skip-gram approach for learning multi-prototype word embeddings.

Information geometric representation of words, especially from a Gaussian perspective, has also gained quite a bit of traction ever since proposed Word2Gauss. Instead of the traditional position based representation in Word2Vec and GloVe, Word2Gauss is a density-based distributed embeddings which presents a method for learning representations in the space of Gaussian distributions. They also introduce an asymmetric measure of similarity between words using KL-divergence which has been seen to model hypernymy relations between words where the distribution for one word (e.g. music) encompass the distributions for sets of related words (jazz and pop). proposes training words as points in the hyperbolic manifold(Poincare disk and Poincare ball) and provides a mathematical connection between their representation and Word2Gauss that helped explain the hypernymy-hyopnymy relations in Gaussian word embeddings better. The Word2Gauss model is extended even further when proposed a Gaussian mixture model to account for the multiple senses of the word. Each sense of a word is given a Gaussian distribution which are then linearly summed to get the Gaussian distribution for the word as a whole. Qualitative results are provided to show that they capture semantic information better than skip-gram and Gaussian embeddings in word similarity and lexical entailment. builds on their idea to provide Gaussian mixture embedding for a word where each component is a Gaussian embedding representation of a sememe(unit of semantic information in a word). The loss function proposed also explicitly takes into account synonymy and hypernymy information and models such semantic relations well enough to produce state-of-the-art results in word similarity and lexical entailment. They also propose a novel “contextualizer” model which takes Gaussian embeddings as input and produces contextualised Gaussian representation for context-sensitive tasks such as named entity recognition and text classification.

Spectahedron Manifold

Let \(U \in \mathbb{R}^{d \times r}\) be a full rank matrix. A word \(w\) is represented by \(d \times d\) symmetric positive semi-definite matrix \(UU^T\) whose rank is \(r\). For homogeneity, we want the eigenvalues of matrix \(UU^T\) to be non-negative and sum to 1. Given the fact that all the eigenvalues of a positive semi-definite matrix are already non-negative and the trace of a matrix is equal to the sum of its eigenvalues, we can add the constraint \(\text{tr}(UU^T)= ||U||_F = 1\) to the optimization space of \(U\) where \(||.||_F\) is the frobenius norm operator. Any optimization problem on this matrix space of \(U\) with the above constraint can be turned into a unconstrained optimization problem in the Spectahedron manifold.

discusses Spectahedron manifold in detail. The Spectrahedron manifold, also known as the set of correlation matrices (symmetric positive semi-definite matrices) of rank \(r\) with unit trace is mathematically defined as - \(\begin{align} \mathcal{S}(d,r) = \{ Q \in \mathbb{R}^{dxd} | a^\top Q a \geq 0 \quad \forall a \in \mathbb{R}^n\ \; , \text{tr}(Q)=1, \nonumber \\ Q = UU^T\;\text{for}\;U \in \mathbb{R}^{d \times r}\;\text{and}\;\text{rank}(Q) = \text{rank}(U) = r\} \end{align}\) Please also note that \(r \leq d\) for all intents and purposes. We now need matrix representations for the tangent space, closed form expression for the gradient and a retraction operation to perform Riemannian gradient descent on the manifold. The tangent space representation \(T_{Q}\mathcal{S}(d,r)\) for a point \(Q \in \mathcal{S}(d,r)\) is given by - \(\begin{align} T_{Q}\mathcal{S}(d,r) = \{ Z \in \mathbb{R}^{d \times r} : \: \text{tr}(Z) = 0 \;, Z^\top Q = Q^\top Z \} \end{align}\) A natural choice of riemannian metric for the above tangent space is \(\text{tr}(A^TB)\) where \(A,B \in T_{Q}\mathcal{S}(d,r)\). The gradient is discussed in \autoref{grad}. The Riemannian gradient at \(Q\) for a function \(F\), \(\text{grad}_{X} F\) is calculated as follows - \(\begin{align} \text{grad}_{X} F = \nabla_{Q}F - \text{tr}([\nabla_{Q}F]^\top X)X \end{align}\) where \(\nabla_{Q}F\) is the euclidean gradient of \(F\) at point \(Q\).

Finally, a natural retraction of choice which is also computationally feasible is - \(\begin{align} Ret_{Q}(Z) = \frac{Q + Z}{||Q + Z||_F} \end{align}\) where \(Z\) is the search direction on the tangent space \(T_{Q}\mathcal{S}(d,r)\),

The Optimization problem

From Spectahedron to Spherical

The modelling for this problem is built around the idea of skip-gram negative sampling modelling proposed in the Word2Vec paper. We have a positive training pair \((u,v)\) where word \(v\) appears in the local context window of word \(u\) and negative training tuple \((u, n)\) where word \(n\) is negatively sampled and not present in the context window of \(u\). Let word \(u, v\) and \(n\) be represented by the positive semi-definite matrices \(\bar{U}=UU^T\), \(\bar{V}=VV^T\), \(\bar{N}=NN^T\) where \(U,V, N\) are all \(d \times r\) size matrices. This means we would have to optimize with computations on square matrices \(\bar{U}, \bar{V}, \bar{N}\). This becomes an extremely expensive operation when the dimension size, \(d\) is higher. However, with \(U, V, N\) being unit norm matrices and the the gradient formula calculation being very similar to spherical case allows spherical geometry optimization ideas to be used. The matrices of size \(d \times r\) are reshaped and stored as vectors of dimensions \(dr\). At each step, the vectors in question are reshaped back into matrices during the calculation of the loss function and Euclidean gradient. Following that, they are reshaped into the vectors again and the Riemannian gradient expression of the spherical geometry is used because the gradient expression would have been the same but in the matrix form if it were kept as matrices. Finally, the retraction operation for the sphere is used for the update as they enforce the unit norm again. We express the loss function in terms of the matrices and derive the expressions for the Euclidean gradients of the loss with respect to these matrices.

Metric and Loss Function

We have the choice of two different metrics to train our Spectahedron manifold. The first one is a similarity metric derived from the Bures-Wasserstein distance while the second metric is the Frobenius-norm between the two word matrices.
\(\begin{align} S_{BW}(U, V) = \text{tr}(\text{sqrtm}((U^\top V V^\top U))) \label{eq:ref2} \end{align}\)

\(\begin{align} S_{FR}(U,V) = \sqrt{\text{tr}(U^\top V V^\top U)} \label{eq:ref3} \end{align}\) where \(\text{sqrtm(.)}\) is the matrix square root of a positive semi-definite matrix. We use the Bures-Wasserstein metric as it has interesting geometric implications we discuss later. Note that the Frobenius-norm metric has faster computation of gradients compared to BW metric due to the latter using matrix square root operation in its gradient calculation. Both \(S_{BW}\) and \(S_{FR}\) admit values in the range of \([0,1]\). Since the metric outputs are taken on a very limited range of values, a ranking-based loss is more suitable for our model. A max-margin ranking objective, similar to that used in Rank-SVM or Wsabie or in Word2Gauss and JoSE , which pushes scores of positive pairs above negatives by a margin. So, the loss function for our Skip-gram negative sampling model is

\(\begin{align} L_{m}(U,V,N) =\max (0, m - S_{BW}(U,V) + S_{BW}(U, N)) \end{align}\) where \(m > 0\) is the margin, \(U\) is the focus word matrix, \(V\) is the context word matrix and \(N\) is the negatively sampled word matrix.

Finally, the closed form euclidean gradient expression for \(U, V\) and \(N\) are given by the following: \(\begin{align} \frac{\partial L}{\partial U} = -2 V(V^\top U)\text{dsqrtm}(V^\top U U^\top V, I) + 2 N(N^\top U) \text{dsqrtm}(N^\top U U^\top N, I) \end{align}\)

\[\begin{align} \frac{\partial L}{\partial V} = -2 U(U^\top V)\text{dsqrtm}(V^\top U U^\top V, I) \end{align}\]

\(\begin{align} \frac{\partial L}{\partial N} = 2 U(U^\top N)\text{dsqrtm}(N^\top U U^\top N, I) \end{align}\) where \(\text{dsqrtm}(X,Z)\) is the derivative of matrix square root of \(X\) along the matrix \(Z\) and \(I\) is an Identity matrix of order \(r\).

Qualitative analysis of Polysemous Words

We perform eigen decomposition of the Spectahedron matrix that represents a polysemous word (e.g. apple). Each of the resulting eigenvectors represents each individual sense of the word apple with the corresponding eigenvalues representing the weight of that sense in the corpus. For practical purposes, we extract only the top \(r\) eigenvectors since the rest \(d-r\) eigenvectors of \(d \times d\) matrices have eigenvalues close to 0. So, each word has now \(r\) vectors to represent its \(r\) senses. We run a nearest neighbour word search on these \(r \times V\) words (\(V\) = total number of words in the vocabulary) with cosine similarity as the metric of similarity between vectors. The tables below show how each eigenvector of a word captures one specific sense of it. We observe pretty interesting trends. The linguistic implication of such results is huge as one could use this technique to see how word contexts have changed over centuries by training various historical texts in different periods of time.

The following are the results for \(d=200, r=3\).

Word	First Eigen Vector	Second Eigen Vector	Third Eigen Vector
apple	macintosh, amiga, mac, os, pc, ibm, microcomputer, compatibles, hardware, gui, commodore, clones, software, iic, workstations	pomegranate, palm, teas, berries, flowers, peripherals, tart, pears, grapevine, herbs, roasted, olive, barley, mango	dark, oil, cup, cultivar, much, widely, consists, watershed, mediterranean, stages, solution, lemon, yellow, leaves, fruits
DC	wildstorm, watchmen, superheroes, supervillain, miniseries, fantagraphics, superpowered, supergirl, nightwing, smallville	amplifier, signals, size, bandwith, signal, frequency, transistor, connectors, impedance, transistors, analog, setup, device	film, makeup, external, returns, ninja, billion, million, sandman, volume, director, marvel, detective, clerks, villains
soul	immortality, eternal, god, mind, divine, atman, spirit, afterlife, consciousness, love, spiritual, heaven, souls, grace, everlasting, contemplation, goodness, essence, intellect	soul, judas, synthpop, muggs, sampling, hit, rappers, righteous, adult, cure, folk, djs, acts, pianists, arrangements, electric, sweat, alternative, radiohead	unconscious, obelix, despised, window, gets, proofs, triune, gnomes, runs, devil, table, balcony, tubulin, ousia, adventures
bill	clinton, johnson, frist, carter, jim, tony, bob, mike, joe, president, act, gerry, larry, steve, dave	british, reality, uk, candian, house, legislation, played, owner, entitled, andy, prime, members, enacted, king, last, japanese	bit, war, album, feet, science, vote, until, broadcast, greatest, announced, won, meters, saw, completed, age

We observe, for instance in the case of apple, the first eigenvectors relates to words that appear in the context of apple as a company while the second as apple in the context of fruits. Such trends can be observed for other words as well. There can be also be noise sometimes as demonstrated by the third eigen vector for certain words.

Word	First Eigen Vector	Second Eigen Vector	Third Eigen Vector	Fourth Eigen Vector	Fifth Eigen Vector
apple	macintosh, mac, os, pc, ipod, ibm, desktop, iigs, windows, imac, gui, iic, pcs, intel, powerpc, hardware, commdore	honey, lotus, leaves, fruit, sweet, juice, machine, drink, fermented, berries, beans, grape, apples, leaf, water, bread	performed, popularity, writing, production, several, roman, closed, teaching, latin, moving. program, hands, moving, discussion, terms	king, mystery, chronicles, news, billboard, caustic, singles, revenge, curtain, river, seer, trucker, homer, lasted, genre, remixes, knuckles, album, sleeve, beastie	events, plate, gravity, market, inertia, exchange, grounds, falls, bodies, chronicle, local, rotational, causality, acceleration, newton, meaning, gravitation, providence, inertial, rotation, refraction, water
dc	marvel, superman, washington, batman, kirby, supergirl, lois, luthor, smallville, superboy, clark, waid, superpowered, romita, miniseries, schengen, lex	locations, washington, villains, superhero, broadway, prince, strip, space, science, series, conventions, bird, theater, has	voltage, amplifier, circuits, wired, resistor, known, single, impedance, voltages, circuit, transistors, resistors, device	persons, land, increase, sq, months, about, statute, income, amounted, degrees, than, km, species, hours, hundred, kiss, passengers, election, typically, weight	group, website, db, engine, strategy, replace, acronym, years, controversy, islands, future, europe, mitsubishi, official, aims, footnotes, bbci
sin	sins, god, eternal atonement, venial, damnation, repentance, satan, heaven, everlasting, sinner, redemption, righteousness, depravity, mortal, apostasy	cos, case, cdots, frac, cdot, greek, qquad, y, let, pi, quad, ldots, exp, dt, theta, begin, exp. sqrt, mathbf	dimension, matrix, dog, space, roles, little, emotions, special, stone, heaven, god, child, good, features, show	official, duplicated, value, freely, product, heritage, dominated, devil, decline, stress, possibilites	writing, surpassing, metropolitan, mathematician, wolf, ring, needless, detroiters, burrito, circling, mayor, education, skyline, vestibule

Similar results are observed for \(d=200, r=5\). Notice the difference in the word apple with the previous case. The fifth eigen vector represents apple in a more limited and specific context like Newton discovering gravity through the falling of an apple.

A Gaussian Embedding Perspective

The Bures-Wasserstein similarity metric allows each word embedding to be interpreted as a zero-centred multivariate Gaussian distribution . For \(r \leq d\), the symmetric positive semi-definite matrix of a word embedding is equivalent to the positive semi-definite covariance matrix i.e. \(\mathcal{N}(0, U)\) for a word matrix \(U\). In the case of \(r=d\), the covariance matrix is positive definite(full rank). We know that the covariance matrix defines both the spread (variance), and the orientation (covariance) of any data. The eigenvector with the largest eigenvalue is the direction along which the data set has the maximum variance and for successive eigenvalues the data set variance goes down. As we will see in the next section, the eigenvectors are associated with the senses of the word. This suggests that the most dominant sense of the word has the maximum variance and the least dominant sense has the least variance. If each occurrence of the word in the corpus is one “data sample”, we are essentially modelling the total occurrences of that word in that corpus as the total data space of that word where each senses moves towards its respective cluster. Such a behaviour is actively forced as a part of modelling in but due to the geometry of Spectahedron it occurs more naturally for us, albeit a little weakly.