Multi-sense information of a polysemous word is well conveyed through the geometry of Spectahedron Manifolds
Unsupervised global word embedding models like GloVe
and Word2Vec
map each word to a single point in a vector space with. As a result of that, all possible meanings of a polysemous word present in the corpus are conflated into a single representation. This phenomenon is termed as meaning conflation deficiency which causes some unwanted semantic modelling behaviour like pulling two seemingly unrelated words close together in the latent space. We take the example of two words: “rat” and “screen” to demonstrate that. They are pulled towards each other in the embedding space because both words are synonymous with the word “mouse” under two different senses. If a distinction is not made between the senses, they have negative effects on the accuracy of semantic modelling. One popular solution to this problem is to use sense representation i.e. each individual sense of a word present in a corpus is encoded with an independent vector embedding My goal is to look for a well thought out word representation model that can tackle this multi-sense problem by having a single mode of representation for each polysemous word and then through a computationally efficient function, we produce a single vector representation for each sense. One possible way to do that would be to encode the words as points in a Grassmannian manifold, which are essentially \(r\)-dimensional subspace for some \(r \in \textbf{N}\). Points in a Grassmannian manifold are subspaces which can be computationally represented as orthonormal matrices of size \(d \times r\) where \(d\) is the dimension of the entire vector space and \(r\) is the dimension of the subspace. Since the senses of a word are largely unrelated to each other, we conjectured that each non-zero linearly independent eigenvector that makes up the subspace would represent an individual sense of the word. Therefore, a full rank matrix of rank \(r\) should encode \(r\) different senses of the word. However, that along won’t be enough. We need to quantify the occurence of each sense in the corpus with “weights”. A particular sense will be more prominent if it occurs more frequently in the corpus. In fact, there is a high possibility of one sense of a word completely dominating another owing to its higher occurrence in the corpus. So, it becomes necessary to normalize the frequency of occurrence of all the senses associated with the word. Mathematically, these weights are the corresponding eigenvalues of the eigenvectors. I propose an efficient model to train words in a Spectahedron manifold and provide some weak qualitative results that show that polysemous words could benefit from modelling words as point in the Spectahedron manifold.
The issue of polysemy in contextualised embeddings have been addressed before through sense representation models. Word2Vec
by clustering together the similar contexts in which a focus word appears and having individual vector representation for each context cluster. These individual representations are the sense representations of that word. They also added flexibility by having a parametric and non-parametric version of their model. Similarly,
takes pre-trained embeddings and “de-conflate” them using an external semantic inventory like WordNet
. They also show that in downstream tasks like topic categorization, multi-sense representations outperform pre-trained single vector word embeddings.
Information geometric representation of words, especially from a Gaussian perspective, has also gained quite a bit of traction ever since Word2Gauss
. Instead of the traditional position based representation in Word2Vec
and GloVe
, Word2Gauss
is a density-based distributed embeddings which presents a method for learning representations in the space of Gaussian distributions. They also introduce an asymmetric measure of similarity between words using KL-divergence which has been seen to model hypernymy relations between words where the distribution for one word (e.g. music
) encompass the distributions for sets of related words (jazz
and pop
). Word2Gauss
that helped explain the hypernymy-hyopnymy relations in Gaussian word embeddings better. The Word2Gauss
model is extended even further when
Let \(U \in \mathbb{R}^{d \times r}\) be a full rank matrix. A word \(w\) is represented by \(d \times d\) symmetric positive semi-definite matrix \(UU^T\) whose rank is \(r\). For homogeneity, we want the eigenvalues of matrix \(UU^T\) to be non-negative and sum to 1. Given the fact that all the eigenvalues of a positive semi-definite matrix are already non-negative and the trace of a matrix is equal to the sum of its eigenvalues, we can add the constraint \(\text{tr}(UU^T)= ||U||_F = 1\) to the optimization space of \(U\) where \(||.||_F\) is the frobenius norm operator. Any optimization problem on this matrix space of \(U\) with the above constraint can be turned into a unconstrained optimization problem in the Spectahedron manifold.
discusses Spectahedron manifold in detail. The Spectrahedron manifold, also known as the set of correlation matrices (symmetric positive semi-definite matrices) of rank \(r\) with unit trace is mathematically defined as - \(\begin{align} \mathcal{S}(d,r) = \{ Q \in \mathbb{R}^{dxd} | a^\top Q a \geq 0 \quad \forall a \in \mathbb{R}^n\ \; , \text{tr}(Q)=1, \nonumber \\ Q = UU^T\;\text{for}\;U \in \mathbb{R}^{d \times r}\;\text{and}\;\text{rank}(Q) = \text{rank}(U) = r\} \end{align}\) Please also note that \(r \leq d\) for all intents and purposes. We now need matrix representations for the tangent space, closed form expression for the gradient and a retraction operation to perform Riemannian gradient descent on the manifold. The tangent space representation \(T_{Q}\mathcal{S}(d,r)\) for a point \(Q \in \mathcal{S}(d,r)\) is given by - \(\begin{align} T_{Q}\mathcal{S}(d,r) = \{ Z \in \mathbb{R}^{d \times r} : \: \text{tr}(Z) = 0 \;, Z^\top Q = Q^\top Z \} \end{align}\) A natural choice of riemannian metric for the above tangent space is \(\text{tr}(A^TB)\) where \(A,B \in T_{Q}\mathcal{S}(d,r)\). The gradient is discussed in \autoref{grad}. The Riemannian gradient at \(Q\) for a function \(F\), \(\text{grad}_{X} F\) is calculated as follows - \(\begin{align} \text{grad}_{X} F = \nabla_{Q}F - \text{tr}([\nabla_{Q}F]^\top X)X \end{align}\) where \(\nabla_{Q}F\) is the euclidean gradient of \(F\) at point \(Q\).
Finally, a natural retraction of choice which is also computationally feasible is - \(\begin{align} Ret_{Q}(Z) = \frac{Q + Z}{||Q + Z||_F} \end{align}\) where \(Z\) is the search direction on the tangent space \(T_{Q}\mathcal{S}(d,r)\),
The modelling for this problem is built around the idea of skip-gram negative sampling modelling proposed in the Word2Vec paper. We have a positive training pair \((u,v)\) where word \(v\) appears in the local context window of word \(u\) and negative training tuple \((u, n)\) where word \(n\) is negatively sampled and not present in the context window of \(u\). Let word \(u, v\) and \(n\) be represented by the positive semi-definite matrices \(\bar{U}=UU^T\), \(\bar{V}=VV^T\), \(\bar{N}=NN^T\) where \(U,V, N\) are all \(d \times r\) size matrices. This means we would have to optimize with computations on square matrices \(\bar{U}, \bar{V}, \bar{N}\). This becomes an extremely expensive operation when the dimension size, \(d\) is higher. However, with \(U, V, N\) being unit norm matrices and the the gradient formula calculation being very similar to spherical case allows spherical geometry optimization ideas to be used. The matrices of size \(d \times r\) are reshaped and stored as vectors of dimensions \(dr\). At each step, the vectors in question are reshaped back into matrices during the calculation of the loss function and Euclidean gradient. Following that, they are reshaped into the vectors again and the Riemannian gradient expression of the spherical geometry is used because the gradient expression would have been the same but in the matrix form if it were kept as matrices. Finally, the retraction operation for the sphere is used for the update as they enforce the unit norm again. We express the loss function in terms of the matrices and derive the expressions for the Euclidean gradients of the loss with respect to these matrices.
We have the choice of two different metrics to train our Spectahedron manifold. The first one is a similarity metric derived from the Bures-Wasserstein distance while the second metric is the Frobenius-norm between the two word matrices.
\(\begin{align} S_{BW}(U, V) = \text{tr}(\text{sqrtm}((U^\top V V^\top U))) \label{eq:ref2} \end{align}\)
\(\begin{align} S_{FR}(U,V) = \sqrt{\text{tr}(U^\top V V^\top U)} \label{eq:ref3} \end{align}\) where \(\text{sqrtm(.)}\) is the matrix square root of a positive semi-definite matrix. We use the Bures-Wasserstein metric as it has interesting geometric implications we discuss later. Note that the Frobenius-norm metric has faster computation of gradients compared to BW metric due to the latter using matrix square root operation in its gradient calculation. Both \(S_{BW}\) and \(S_{FR}\) admit values in the range of \([0,1]\). Since the metric outputs are taken on a very limited range of values, a ranking-based loss is more suitable for our model. A max-margin ranking objective, similar to that used in Rank-SVM or Wsabie Word2Gauss
and JoSE
\(\begin{align} L_{m}(U,V,N) =\max (0, m - S_{BW}(U,V) + S_{BW}(U, N)) \end{align}\) where \(m > 0\) is the margin, \(U\) is the focus word matrix, \(V\) is the context word matrix and \(N\) is the negatively sampled word matrix.
Finally, the closed form euclidean gradient expression for \(U, V\) and \(N\) are given by the following: \(\begin{align} \frac{\partial L}{\partial U} = -2 V(V^\top U)\text{dsqrtm}(V^\top U U^\top V, I) + 2 N(N^\top U) \text{dsqrtm}(N^\top U U^\top N, I) \end{align}\)
\[\begin{align} \frac{\partial L}{\partial V} = -2 U(U^\top V)\text{dsqrtm}(V^\top U U^\top V, I) \end{align}\]\(\begin{align} \frac{\partial L}{\partial N} = 2 U(U^\top N)\text{dsqrtm}(N^\top U U^\top N, I) \end{align}\) where \(\text{dsqrtm}(X,Z)\) is the derivative of matrix square root of \(X\) along the matrix \(Z\) and \(I\) is an Identity matrix of order \(r\).
We perform eigen decomposition of the Spectahedron matrix that represents a polysemous word (e.g. apple). Each of the resulting eigenvectors represents each individual sense of the word apple
with the corresponding eigenvalues representing the weight
of that sense in the corpus. For practical purposes, we extract only the top \(r\) eigenvectors since the rest \(d-r\) eigenvectors of \(d \times d\) matrices have eigenvalues close to 0. So, each word has now \(r\) vectors to represent its \(r\) senses. We run a nearest neighbour word search on these \(r \times V\) words (\(V\) = total number of words in the vocabulary) with cosine similarity as the metric of similarity between vectors. The tables below show how each eigenvector of a word captures one specific sense of it. We observe pretty interesting trends. The linguistic implication of such results is huge as one could use this technique to see how word contexts have changed over centuries by training various historical texts in different periods of time.
The following are the results for \(d=200, r=3\).
Word | First Eigen Vector | Second Eigen Vector | Third Eigen Vector |
---|---|---|---|
apple | macintosh, amiga, mac, os, pc, ibm, microcomputer, compatibles, hardware, gui, commodore, clones, software, iic, workstations | pomegranate, palm, teas, berries, flowers, peripherals, tart, pears, grapevine, herbs, roasted, olive, barley, mango | dark, oil, cup, cultivar, much, widely, consists, watershed, mediterranean, stages, solution, lemon, yellow, leaves, fruits |
DC | wildstorm, watchmen, superheroes, supervillain, miniseries, fantagraphics, superpowered, supergirl, nightwing, smallville | amplifier, signals, size, bandwith, signal, frequency, transistor, connectors, impedance, transistors, analog, setup, device | film, makeup, external, returns, ninja, billion, million, sandman, volume, director, marvel, detective, clerks, villains |
soul | immortality, eternal, god, mind, divine, atman, spirit, afterlife, consciousness, love, spiritual, heaven, souls, grace, everlasting, contemplation, goodness, essence, intellect | soul, judas, synthpop, muggs, sampling, hit, rappers, righteous, adult, cure, folk, djs, acts, pianists, arrangements, electric, sweat, alternative, radiohead | unconscious, obelix, despised, window, gets, proofs, triune, gnomes, runs, devil, table, balcony, tubulin, ousia, adventures |
bill | clinton, johnson, frist, carter, jim, tony, bob, mike, joe, president, act, gerry, larry, steve, dave | british, reality, uk, candian, house, legislation, played, owner, entitled, andy, prime, members, enacted, king, last, japanese | bit, war, album, feet, science, vote, until, broadcast, greatest, announced, won, meters, saw, completed, age |
We observe, for instance in the case of apple, the first eigenvectors relates to words that appear in the context of apple as a company while the second as apple in the context of fruits. Such trends can be observed for other words as well. There can be also be noise sometimes as demonstrated by the third eigen vector for certain words.
Word | First Eigen Vector | Second Eigen Vector | Third Eigen Vector | Fourth Eigen Vector | Fifth Eigen Vector |
---|---|---|---|---|---|
apple | macintosh, mac, os, pc, ipod, ibm, desktop, iigs, windows, imac, gui, iic, pcs, intel, powerpc, hardware, commdore | honey, lotus, leaves, fruit, sweet, juice, machine, drink, fermented, berries, beans, grape, apples, leaf, water, bread | performed, popularity, writing, production, several, roman, closed, teaching, latin, moving. program, hands, moving, discussion, terms | king, mystery, chronicles, news, billboard, caustic, singles, revenge, curtain, river, seer, trucker, homer, lasted, genre, remixes, knuckles, album, sleeve, beastie | events, plate, gravity, market, inertia, exchange, grounds, falls, bodies, chronicle, local, rotational, causality, acceleration, newton, meaning, gravitation, providence, inertial, rotation, refraction, water |
dc | marvel, superman, washington, batman, kirby, supergirl, lois, luthor, smallville, superboy, clark, waid, superpowered, romita, miniseries, schengen, lex | locations, washington, villains, superhero, broadway, prince, strip, space, science, series, conventions, bird, theater, has | voltage, amplifier, circuits, wired, resistor, known, single, impedance, voltages, circuit, transistors, resistors, device | persons, land, increase, sq, months, about, statute, income, amounted, degrees, than, km, species, hours, hundred, kiss, passengers, election, typically, weight | group, website, db, engine, strategy, replace, acronym, years, controversy, islands, future, europe, mitsubishi, official, aims, footnotes, bbci |
sin | sins, god, eternal atonement, venial, damnation, repentance, satan, heaven, everlasting, sinner, redemption, righteousness, depravity, mortal, apostasy | cos, case, cdots, frac, cdot, greek, qquad, y, let, pi, quad, ldots, exp, dt, theta, begin, exp. sqrt, mathbf | dimension, matrix, dog, space, roles, little, emotions, special, stone, heaven, god, child, good, features, show | official, duplicated, value, freely, product, heritage, dominated, devil, decline, stress, possibilites | writing, surpassing, metropolitan, mathematician, wolf, ring, needless, detroiters, burrito, circling, mayor, education, skyline, vestibule |
Similar results are observed for \(d=200, r=5\). Notice the difference in the word apple with the previous case. The fifth eigen vector represents apple in a more limited and specific context like Newton discovering gravity through the falling of an apple.
The Bures-Wasserstein similarity metric allows each word embedding to be interpreted as a zero-centred multivariate Gaussian distribution