Does Word2Vec Dream of Semantic Sheep?

I played with Google’s magical word2vec neural network some time ago. I found it interesting but I had no immediate use for it, so I filed it in the ‘must remember to check this out further’ section of my disorganised brain. More recently I found myself wrestling with topic modelling resulting in a near terminal headlock. I have a data set of very short (say, 2 – 50 word) documents that I want to group thematically. LDA and it’s various cousins were struggling with this task. There are few of reasons for this. Firstly, even though LDA, LSI etc. perform a sort of vector dimensionality reduction, this only works to a point. A three word sentence contains barely enough information to derive a single topic from, let alone a distribution, and consequently my topic distributions were too sparse to do much with. Secondly, and exacerbating the first point, the topic space across the corpus is pretty limited and somewhat homogeneous. Thirdly, and exacerbating the other two points, there is marked lack of variety in the language used across the corpus, and from one topic to the next. Humans were struggling to distinguish one theme from another, so what chance did a computer have? I was just about to give it up as a lost cause when I remembered that word2vec has some similarity measures, and a vague recollection of someone suggesting it could be used for topic modelling. My basic theory here is that if I can compare sentences for similarity I should be able to group them via that similarity (I’m using a graph clustering model to do this). So as a last ditch effort I ran word2vec over my corpus and started playing around to see if it could make sense of my data. The results were phenomenal! The similarity graph created from a simple, untuned word2vec model outperformed the other models at unsupervised classification 10 fold at least – where before I saw only loose semantic groupings with many mis-grouped items, I now saw empirically cohesive and accurate groups. As pleased as I was by this turn of events, I didn’t understand why Google’s simple neural network worked at all for my purposes, let alone outperformed everything else. So I bathed myself in warm, welcoming, buoyant sea of word2vec’s vector space. As I did so, I started to appreciate word2vec’s spooky action.

It’s not my intention to repeat what’s already been said about word2vec but merely to state my own findings. I’ll start with the crux of my main confusion about the model. I’m using to to interpret and group customer feedback. My understanding of word2vec is that it groups words that are semantically similar, or at least proximal. You can explore the most similar words to any given word or words easily in word2vec. My model was trained on a million or so items of customer feedback from a website survey. I should be swift here to mention that they have an excellent site that works well for the vast majority of customers, but like every website, doesn’t preform well for everyone all the time. So two words that occur in close proximity a fair amount are “site” and “slow” (it’s also worth pointing out that “site” actually co-exists with “fast” more often in the same corpus, however, we’re looking for problems to solve, not praise to lap up). However, when I looked at the top 50 words in closest proximity with “site”, “slow” was nowhere to be found. I get loads of similes of the the word “site” (e.g. website). And and all the words most closely related to “slow” are other words that loosely mean slow (e.g. sluggish). It was obvious to me at the point what word2vec was actually doing in this instance, and that my expectation was not aligned with how it works, but this lead me to a bit of an epiphany – holy shit, word2vec understands actual semantic relationships between words without any formal teaching; purely by inference! To put it another way, it groups together words it rarely if ever sees together, even in the same document. That’s pretty clever for a simple neural network with only a single hidden layer. I also picked out similarities in misspelled words. This is deceptively helpful since one of my core frustrations with the data sets (at least for the purposes of supervised learning and approach I’m also using) was that people don’t tend to put much effort into spell checking what they enter into an online survey. That’s some pretty spooky action!

So, on the surface at least, it seemed that the reason my similarity scores were so on the mark was that word2vec was cleverly able to pair of similar words in a sentence and thus was able to create robust similarity scores. However, this explanation is a bit Newton to Einstein – a good explanation, but not the whole story. Word2vec’s spooky action is a lot more abstract and, dare I say it, mysterious. This deserves a little more probing. The model that word2vec actually produces is actually the single hidden neural network layer previously referenced. It consists of a n x m vector space where n is the number of words in your corpus (optionally pruned to get rid of infrequent or too frequent cruft) and m is the an arbitrary number of floating point dimensions usually some where between 100 and 700. These dimensions are somewhat intractable from a human perspective. They are quanta of a continuous abstract vector space that maps a territory of words in a sort of semantic relief map. Words of similar meaning exist in the same general area of the map. Tribes of words exist in a single area just like tribes of people do in the real world. This extends past similes to words of the same type however. I ingested the prebuilt vector space on the word2vec home page (the 300 dimension Google news one) and did some exploring. I discovered various bits of spooky action:

  • Names of musicians (Alice Copper, Ozzy Osbourne, and David Bowie) coexist together with bands (Metallica, Motorhead) all of which bare no relation to, say “cheese”
  • Names of US presidents occupy the same general space, with small offsets that seem to suggest political affiliations (more research needed here), as do scientists with a little evidence that they cluster with their respective disciplines
  • Parts of the brain (and indeed neurotransmitters) all occupy the same space, and a cursory appraisal suggests that closer proximity exists form those parts closer to each other (hypothalamus, nucleus accumbens and midbrain all cluster very closely). One assumes that this is the same for all anatomical parts
  • It has no sense of opposites – fast and slow cluster very closely together, black and white even more so
  • Words cluster together when they are notionally similar rather than the same type of word, so “black” and “blackest” cluster close together. There seems no definable continuous space for, say, nouns, proper nouns, adjectives etc.

There’s a sense that it clusters words when they seem interchangeable to a greater lesser degree. The mathematical offset described with the famous “king – man + woman = queen” example seems to reinforce this. The spacial significance comes further to light when you consider the two main ways to interrogate the vector space. The standard way (al la Gensim and others) is to scan the entire vector space for vectors with the closest cosine similarity to the vector of the target word (which is usually something in the same proximity). When you’re dealing with multiple words (e.g. n-grams, sentences or even whole documents) the approach is simply to find the exact vector for each word, then take a column level average across those words to create a new vector which then goes through the same cosine similarity malarchy. When comparing one sentence/document to another, we take the same average and get the cosine between the two. The second approach looks to take an actual numerical offset from one words, or collection of words, to another as per Kusner et al. In combination these two approaches make the topological aspect of the model more salient still suggesting that the word embeddings exist in some intangible semantic space-time of numerous dimensions and geometry which, in a very real sense, is exactly what it is. A simile generator may be a convenient way to describe it, but the reality is much more complex and elegant. So back to my original idea that word2vec was pairing off words in a sentence. This is not the case at all. Using the cosine similarity approach as I was, what was actually  happening is that I was generating vector point constructed from an average of a collection words, then go find other word or words that are proximally (spatially) close. Words are never compared, we actually just go find the tribe that has the most DNA in common with my sentence, as it were.

So how does it figure this stuff out? Well, some much better mathematicians than me already concluded that they “don’t really know”, so what hope have I got? I can just try and make some sense of what I observe. Much has been made of the difference between CBOW and Skip-gram as way to evaluate the text, however there seems little to suggests that either contributes to the overall spooky action, rather than build upon the the mysterious workings of a simple neural network. The information is in there, in all the written text, and nothing is inferred that isn’t visibly available – there’s no extrapolation going on here, no logic. Word2vec doesn’t read it or understand the text, it just picks up patterns. Interpretation as an adjunct to semantic awareness is a job for a future, much more sophisticated algorithm or model or AI. The best analogy I can think of for word2vec is the very mechanism that neural networks try and emulate – the human brain. In particular long term memory. It’s easy to imagine that, as information flows in through our senses, it is brokered into similar abstract representations in the cerebral cortex, then either reinforced (learned) or forgotten. We know that the best way to remember something is to relate it to something to something we already have a good sense of. Then when you recall that thing, you also recall a sense of the other stuff that you squirrelled it away with. Thus when you recall Alice Cooper from memory, Ozzy Osborne sometimes emerges with him, along with a bat or a chicken maybe, but never a block of stilton.

Internet Advertising Ethics – the gorilla in the room

Much has been said of late about the rise of the adblocker and what it means for the future of the advertising industry and, more worryingly, the internet.  I’ve largely kept my gob shut on the subject of advertising ethics up until now since it’s not very fashionable to stand up for the evil advertising community. But, after reading this call to arms for ethics academics, my resolve has been shattered. Bear with me caller, I shall explain.

Now let me start by saying I have no reason to besmirch Williams’s character, or single him out for my wrath – his article is well written and well argued and is clearly placed in the public domain for debating, and that’s exactly what I’m doing. Why respond to this article in particular? Simply, it belongs to the extreme edge of a movement whose point of view I have some (soon to be explained) objections to. Also, it arrived in my world at the point where my silence on the subject was already faltering.

Let me quote the final statement of the article as a flavour of the overall thrust of the debate:

 

Given all this, the question should not be whether ad blocking is ethical, but whether it is a moral obligation. The burden of proof falls squarely on advertising to justify its intrusions into users’ attentional spaces—not on users to justify exercising their freedom of attention.

 

Lofty ideals there. However, what Williams entirely ignores (as do the comments that I have read) is that most websites and the adverts on them exist to offer some sort of product or service that people actually want (either directly or by later fulfilment). What advertising does is draws people’s attention to those things, while also forming an essential part of the business model of the site displaying the ad. And you cannot blame any given company for desiring that any given consumer receives the aforementioned product/service from them rather than someone else; after all, most businesses earnestly believe their product/service is superior, whether it is or not, as otherwise, why bother? If people find themselves on mediums that consume their attention, or assailed by ads that distract them from doing the things that they supposedly desire (I think many people would admit to wanting to spend time playing Xbox games as much as, if not more than, spending time with family – hey, why not combine the two!) then maybe they want/need to be distracted.

Secondly, the cognitive bias/behavioural economics argument is a red herring. As the legendary Harvard Gorilla Experiment demonstrated, people are spectacularly good at missing bleeding obvious stuff when they are engaged in a task, and that’s assuming you can get them engaged in the first place, as if they’re not interested in something, they simply won’t engage. If a person is so unengaged in the task of absorbing some web content that they get distracted by an advert, it suggests that the content isn’t much cop anyway and probably doesn’t deserve the attention it was getting. If a site places an ad that’s so invasive that makes it hard for the consumer to consume their content, then they can’t have much confidence in that content, and the consumer should certainly consider clearing off. But this even misses the key flaw to the argument – the biases alluded to evolved precisely so that we, as humans/mammals/animals, can focus our attention on what matters while also remaining alert to potential threats or, dare I say it, more interesting stuff. Saying that it is somehow unethical to appeal to these so called “biases” (I prefer the term heuristics) is like saying that “blue cars should not be manufactured since people a drawn to blue and that would distracts them from the car’s overall ‘carness’” or that “people shouldn’t dress nice lest people might fancy them” (that last one is exploited quite a lot in certain religions). We’re built to desire stuff (food, sex) – if we didn’t we’d (literally) die as individuals and as a species. If people spend too much time on Facebook, or are distracted by ads, or get obsessed by Candy Crush and forget to collect their kid from school, it says more about the psychology and evolution of human nature than the medium itself. We were designed to do what natural selection designed us to do. Facebook, Twitter, Daily Mail Online, the internet are symptoms of that, not the cause.

Further to this, these mediums for supposed attention corruption (the sites that house the adverts) are pretty damn good at keeping our attention. Williams states “A product or service does not magically redesign itself around your goals just because you block it from reaching its own”. But that’s precisely what they do. Facebook (for example) is AMAZING at holding attention, ads or otherwise. This is the case because they collect usage data about billions of people and their site optimises itself, in real time, around what people respond well to. Everyone has their own goal when using Facebook (frequently to spend “time” with absent family), Facebook’s “product” is that goal. Facebook spends a hell of a lot of money on making their product as good as it can be, and they know that they are successful when people spend lots of time using it! That cost is accounted for by your advertising eyeballs. So by negating Facebook’s revenue stream, Williams is denigrating their ability to do the very thing he’s (paradoxically) getting antsy about them not doing (building a customer-centric experience). No doubt sites like Facebook could be better, but starving them of cash ain’t gonna help them in this endeavour!

So when the legions of ethics academic rise up and block the sorry arse out of internet advertising, which subsequently results in the news sites where they get their celebrity gossip going out of business, leaving them only with the Murdoch funded, reactionary corporate propaganda-media (with all their ethics and stuff), they only have themselves to blame. Perhaps then, they will offer a better alternative to just “blocking” the problem out of view!

I feel a little like I’m defending the devil here, but if the free internet is to be maintained there is a balance to be struck. Advertisers need to work harder to build better online experiences, and consumers need to continue to put up with their attention being corralled a bit. It will be a rocky road to the equilibrium where both advertiser and consumer are happy, but the forces of reciprocal value exchange demand that that day must come.

Now, were Williams to make the broader argument about how those exploitations of attention lead to unhealthy lifestyles by tempting us with what we innately desire – e.g. fat and sugar and sex, and lots of all of it – which some people are largely powerless to resist, then I would fully support it as an ethical debate. If we’re here to debate the ethics of rampant non-concented data collection and abuse, I’m all ears. But the ethics of trying to get people’s attention? Get the gorilla out of here!

17 seconds

I don’t know exactly how long it takes to read the 50 or so words attributed to me in a recent Guardian article, but I doubt that it equates to 5 minutes, probably more like 17 seconds, meaning that I still have the larger portion of my 5 minutes of fame to come. What wonders await is anyone’s guess, but in the meantime I will juxtapose those 17 seconds of written text with a note of clarification.

On enthusiastically posting a snippet of said article on Twitter (and while I sat back and basked in the adulation) an old friend and colleague and data guru @hankyjohn responded to one of my points with a contradiction. Specifically, I said:

But Loveless said he associated the idea of having a single customer view with “big, monolithic, old school, relational databases, which are horribly hard to manage and incredibly expensive”. Just collecting data on customers for its own sake is useless unless you can do something useful with it, he said: “You don’t need to understand everything about the customer, you don’t need to collect and structure everything about the customer, you just need to have a sense about them.” He said the new data management platforms do not promise a single customer view, just a general view of what that person likes and does.

To which @hankyjohn responded (quite correctly):

@alexmloveless good work. Can I disagree though? False dichotomy for me. Traditional data warehousing can coexist nicely with other stores.

 
There followed a brief exchange in which I heroically clarified my point. Rather than subject you to those stilted 140 character info-barks, I’ll summarise the crux of my points here.

Although I completely stand by the point illustrated in that article, it sits removed from a broader context that would have been apparent were you in the room at the time. The wider point is this: Since the days of when advertising was first invented (by the people on Mad Men) marketers and the like have endeavoured to understand their customers. Such understanding, for the vast majority of the intervening period, was derived from stuff we can learn from any detail we can collect about them (name, address, demos etc.) and performance data (what works and what doesn’t). The former data probably existed on bits of card in filing cabinets for a long time before eventually being diligently transmogrified into their digital equivalent when computers became a thing. These digital equivalents eventually required a structured form so that they could be easily accessed and queried for the purposes of selling us stuff that we don’t need. The medium for this structure was the humble database, of which for a long time there was really only one form worth talking about, the RDBMS – relational databases. Relational databases are marvellous. They impose structure on unruly data and make it easy to access, analyse and aggregate. Thus, modern marketing became used to using these things to store their customer data, which needed to be kept clean and tidy. This was how you knew who your customers were – you kept records of them in a big old RDBMS called “Customer DB” or “CRM Store” or something equally as enticing. Problem is, since there were many different sources of data, companies frequently ended up with multiple stores, often storing overlapping data sets. Quite rightly at some point marketers and IT people alike started saying things like “wouldn’t it be great if all this data was deduplicated and stored in once place” and thus was born the dreaded Single Customer View.

Roll on a decade and SCV projects that were started on the back of wishes from marketers are still incomplete and running up legacy costs of tens of millions. Meanwhile, while failing to deliver on the meagre requirements of the time, we now have all these bloody channels and social networks and mobile devices and internets-of-things and Bigness of Data. Asking IT to justbloodywell get me a dataset I can trust is trouble enough let alone incorporating twitter handles and cross device awareness. Yet marketers are still asking such things of an SCV thinking that this once-so-called magic data bullet is actually the right place for such things.

The belief is still widely held that customer data really only can live in a big ole monolithic relational data store. This comes from a lack of distinction perpetuated on both the marketers and IT people’s part. The distinction is between Master Data Management (MDM) and, well, all the other types of data. It’s a distinction between hard, indelible customer data for the purposes of hard, lofty uses, vs the sort of fuzzy profiling that proliferates across the web and haunts you with depressing display adverts for TV’s you had briefly considered buying before that whopping council tax bill came in.

Modern marketing data is not about coherent customer information, it’s about cookies and inferred data. When marketing to (or at) someone it’s more useful to know their gender than their name. A mobile geolocation is better than a postcode. A constantly evolving stream of inferred preference data is better than a mosaic classification. This is all achieved by a web of data collection technologies and services that use the humble cookie as their primary currency and couldn’t give a hoot what your name is. You could try and mash all this lovely data into your SCV but you’d end up changing your schema every two weeks and probably hit performance/scale issues pretty quickly. Plus it’ll take 6 years and countless more million quid when you could have invested in one of those mystical unicorn DMP thingies. In such a circumstance your beloved SCV data would mostly be flowing in the other direction and consequently making an anonymised cookie store the most complete view of your customer data. God forbid!

Now don’t go rushing off to Adobe or Oracle while instructing your IT team to delete that pesky SCV. You probably need it. Email comms would not be possible without it. And if you have a more tangible relationship with your customer (like, you sell stuff to them) you need a master record with accurate, non-volatile information about them that’s nicely structured, secure and private. This is Master Data Management, and only relates secondarily to marketing. And as the learned @hankjohn correctly points out, it sits happily and harmoniously in a mature data ecosystem with anarchic jonny-come-latelys like DMPs (and a bunch of other sinister data entities).

This was the thrust of my grumpy diatribe at the Guardian offices, which perhaps doesn’t come through too well in the article. I wasn’t misquoted as such, just underquoted. The moral of this story? Write more about me.