Does Word2Vec Dream of Semantic Sheep?

I played with Google’s magical word2vec neural network some time ago. I found it interesting but I had no immediate use for it, so I filed it in the ‘must remember to check this out further’ section of my disorganised brain. More recently I found myself wrestling with topic modelling resulting in a near terminal headlock. I have a data set of very short (say, 2 – 50 word) documents that I want to group thematically. LDA and it’s various cousins were struggling with this task. There are few of reasons for this. Firstly, even though LDA, LSI etc. perform a sort of vector dimensionality reduction, this only works to a point. A three word sentence contains barely enough information to derive a single topic from, let alone a distribution, and consequently my topic distributions were too sparse to do much with. Secondly, and exacerbating the first point, the topic space across the corpus is pretty limited and somewhat homogeneous. Thirdly, and exacerbating the other two points, there is marked lack of variety in the language used across the corpus, and from one topic to the next. Humans were struggling to distinguish one theme from another, so what chance did a computer have? I was just about to give it up as a lost cause when I remembered that word2vec has some similarity measures, and a vague recollection of someone suggesting it could be used for topic modelling. My basic theory here is that if I can compare sentences for similarity I should be able to group them via that similarity (I’m using a graph clustering model to do this). So as a last ditch effort I ran word2vec over my corpus and started playing around to see if it could make sense of my data. The results were phenomenal! The similarity graph created from a simple, untuned word2vec model outperformed the other models at unsupervised classification 10 fold at least – where before I saw only loose semantic groupings with many mis-grouped items, I now saw empirically cohesive and accurate groups. As pleased as I was by this turn of events, I didn’t understand why Google’s simple neural network worked at all for my purposes, let alone outperformed everything else. So I bathed myself in warm, welcoming, buoyant sea of word2vec’s vector space. As I did so, I started to appreciate word2vec’s spooky action.

It’s not my intention to repeat what’s already been said about word2vec but merely to state my own findings. I’ll start with the crux of my main confusion about the model. I’m using to to interpret and group customer feedback. My understanding of word2vec is that it groups words that are semantically similar, or at least proximal. You can explore the most similar words to any given word or words easily in word2vec. My model was trained on a million or so items of customer feedback from a website survey. I should be swift here to mention that they have an excellent site that works well for the vast majority of customers, but like every website, doesn’t preform well for everyone all the time. So two words that occur in close proximity a fair amount are “site” and “slow” (it’s also worth pointing out that “site” actually co-exists with “fast” more often in the same corpus, however, we’re looking for problems to solve, not praise to lap up). However, when I looked at the top 50 words in closest proximity with “site”, “slow” was nowhere to be found. I get loads of similes of the the word “site” (e.g. website). And and all the words most closely related to “slow” are other words that loosely mean slow (e.g. sluggish). It was obvious to me at the point what word2vec was actually doing in this instance, and that my expectation was not aligned with how it works, but this lead me to a bit of an epiphany – holy shit, word2vec understands actual semantic relationships between words without any formal teaching; purely by inference! To put it another way, it groups together words it rarely if ever sees together, even in the same document. That’s pretty clever for a simple neural network with only a single hidden layer. I also picked out similarities in misspelled words. This is deceptively helpful since one of my core frustrations with the data sets (at least for the purposes of supervised learning and approach I’m also using) was that people don’t tend to put much effort into spell checking what they enter into an online survey. That’s some pretty spooky action!

So, on the surface at least, it seemed that the reason my similarity scores were so on the mark was that word2vec was cleverly able to pair of similar words in a sentence and thus was able to create robust similarity scores. However, this explanation is a bit Newton to Einstein – a good explanation, but not the whole story. Word2vec’s spooky action is a lot more abstract and, dare I say it, mysterious. This deserves a little more probing. The model that word2vec actually produces is actually the single hidden neural network layer previously referenced. It consists of a n x m vector space where n is the number of words in your corpus (optionally pruned to get rid of infrequent or too frequent cruft) and m is the an arbitrary number of floating point dimensions usually some where between 100 and 700. These dimensions are somewhat intractable from a human perspective. They are quanta of a continuous abstract vector space that maps a territory of words in a sort of semantic relief map. Words of similar meaning exist in the same general area of the map. Tribes of words exist in a single area just like tribes of people do in the real world. This extends past similes to words of the same type however. I ingested the prebuilt vector space on the word2vec home page (the 300 dimension Google news one) and did some exploring. I discovered various bits of spooky action:

Names of musicians (Alice Copper, Ozzy Osbourne, and David Bowie) coexist together with bands (Metallica, Motorhead) all of which bare no relation to, say “cheese”
Names of US presidents occupy the same general space, with small offsets that seem to suggest political affiliations (more research needed here), as do scientists with a little evidence that they cluster with their respective disciplines
Parts of the brain (and indeed neurotransmitters) all occupy the same space, and a cursory appraisal suggests that closer proximity exists form those parts closer to each other (hypothalamus, nucleus accumbens and midbrain all cluster very closely). One assumes that this is the same for all anatomical parts
It has no sense of opposites – fast and slow cluster very closely together, black and white even more so
Words cluster together when they are notionally similar rather than the same type of word, so “black” and “blackest” cluster close together. There seems no definable continuous space for, say, nouns, proper nouns, adjectives etc.

There’s a sense that it clusters words when they seem interchangeable to a greater lesser degree. The mathematical offset described with the famous “king – man + woman = queen” example seems to reinforce this. The spacial significance comes further to light when you consider the two main ways to interrogate the vector space. The standard way (al la Gensim and others) is to scan the entire vector space for vectors with the closest cosine similarity to the vector of the target word (which is usually something in the same proximity). When you’re dealing with multiple words (e.g. n-grams, sentences or even whole documents) the approach is simply to find the exact vector for each word, then take a column level average across those words to create a new vector which then goes through the same cosine similarity malarchy. When comparing one sentence/document to another, we take the same average and get the cosine between the two. The second approach looks to take an actual numerical offset from one words, or collection of words, to another as per Kusner et al. In combination these two approaches make the topological aspect of the model more salient still suggesting that the word embeddings exist in some intangible semantic space-time of numerous dimensions and geometry which, in a very real sense, is exactly what it is. A simile generator may be a convenient way to describe it, but the reality is much more complex and elegant. So back to my original idea that word2vec was pairing off words in a sentence. This is not the case at all. Using the cosine similarity approach as I was, what was actually happening is that I was generating vector point constructed from an average of a collection words, then go find other word or words that are proximally (spatially) close. Words are never compared, we actually just go find the tribe that has the most DNA in common with my sentence, as it were.

So how does it figure this stuff out? Well, some much better mathematicians than me already concluded that they “don’t really know”, so what hope have I got? I can just try and make some sense of what I observe. Much has been made of the difference between CBOW and Skip-gram as way to evaluate the text, however there seems little to suggests that either contributes to the overall spooky action, rather than build upon the the mysterious workings of a simple neural network. The information is in there, in all the written text, and nothing is inferred that isn’t visibly available – there’s no extrapolation going on here, no logic. Word2vec doesn’t read it or understand the text, it just picks up patterns. Interpretation as an adjunct to semantic awareness is a job for a future, much more sophisticated algorithm or model or AI. The best analogy I can think of for word2vec is the very mechanism that neural networks try and emulate – the human brain. In particular long term memory. It’s easy to imagine that, as information flows in through our senses, it is brokered into similar abstract representations in the cerebral cortex, then either reinforced (learned) or forgotten. We know that the best way to remember something is to relate it to something to something we already have a good sense of. Then when you recall that thing, you also recall a sense of the other stuff that you squirrelled it away with. Thus when you recall Alice Cooper from memory, Ozzy Osborne sometimes emerges with him, along with a bat or a chicken maybe, but never a block of stilton.