Data Science – Alex Loveless

February 12, 2024

Romancing the AI – Using Neural Networks as part of the creative process

I’m not just an artistic genius you know. I have a day job (well, not at the moment, but let’s pretend) that involves me using clever coding algorithms to tease information out of voluminous and/or complex data sets. It’s both challenging and satisfying and I’m very good at it. It’s also not so different from painting as you might think. Now, before I get into the greasy nuts-and-bolts of this, let me briefly address the DALL-Ephant in the room. Some of my art is created with the help of my pet AI. I don’t want to get into the supposed existential threat that generative AI poses to the artistic community, but I recognise that merely the mention of that TLA (two-letter acronym, and yes I went there, sue me) makes some folks a bit queasy. So let me make something clear – the AIs that I create, either as part of my day job, or to aid my creative process, are neither a threat to anyone’s livelihood, artistic or otherwise, nor the ongoing existence of humanity. I recognise the rather distressing threat that the new breed breed of AI poses to our societal cohesion, and I intend to be part of the solution rather than the problem, but neither my pet Neural Network, nor ChatGPT are sentient, or ever will be, and in my case the use of AI in no way cheapens, short-cuts, or replaces any part of my creative process, as I will discuss here at length. Nor does it steal anyone else’s creative output outside of the referencing of photos and such that is already plainly apparent in much of the rest of my artwork.

Still not convinced? Allow me to walk you through the process, so you can see for yourself. I might get a little bit technical here and there, and feel free to skim any bits that look a bit scary, none of these are essential to the understanding of the overall process. For those that can grok the nuts-and-bolts, please be nice, this was all done in my free time and doesn’t quite pass the quality control I apply to my usual day-job output!

So why do I use an AI in my creative process? Well, I vaguely remember a fit of unconstrained, impulsive ADHD enthusiasm on a long train journey a couple of years ago. I think my thought process was along the lines of “I’ve got all these images that I’ve collected from the web that I use as painting references, I wonder what would happen if applied my python skills to mess with them in interesting ways” and I set about coding my first “image mangler” using my phone right there (I was likely using Pythonista on my iPhone at that point, but were I doing this right now with my Android phone I’d use Pydroid). At that point I my idea was to algorithmically mess with pixel values to mutate the images in interesting ways. This meant dreaming up interesting algorithmic approaches that weren’t those used by the myriad image filters available as standard in all the existing image manipulation tools (e.g. the Gaussian filter).

When you’re manipulating images in any tool, such as Photoshop or GIMP or whatever, what’s actually happening in the background is that your image is converted to a 3-dimensional numerical matrix (aka an array, or sometimes vector) which represents the pixel values of that image – specifically, 3 two-dimensional arrays, one for each of the primary colours of the colour model you’re using, usually RGB – Red, Green and Blue (I could get into a long diatribe about the various colour models and how there’s really no such thing as “primary” colours, but that’s for another day). Each value in the 3 two-dimensional arrays represents a pixel intensity of the pixels that make up that image (0 meaning no colour, 255 meaning full colour). When the images are rendered the three values for each pixel are combined to produce a specific hue (in the subtractive colour model, again, for another day). If you change any of the values of any pixel, you can subtly change the resulting image once the channels are recombined. Since the digital version of this image is a simple mathematical matrix, the manipulations are usually done mathematically. So by adding or subtracting numbers from the pixel values, you can change the colour or intensity of that pixel, and those around it. Clever, eh? Most of the common filters use some pretty hefty and complicated maths, a lot of which I don’t care to spend the time trying to understand. I’m a much baser beast than is capable of such mathematical refinery, and my method is almost always “brute f***ing force” aka, arbitrarily changing shit with pretty simple numerical functions (add, subtract, divide etc.) and seeing what happens. Believe it or not, this is less clever than it sounds. And to make it even less sophisticated I reduced each image to a single channel (in effect, making it black & white, or more accuratly, monotone, since they could be rendered in any single colour) and crop them to a specific size and shape.

Now, paradoxically, it was that last step where all the real magic and fun kicked off. Yes, I produced some interesting effects by manipulating pixel values based on the values of adjacent pixels, but that was rarely the most interesting aspect of the resulting image, it was their composition.

Firstly, let me address the question of why I cropped them at all. Simply, when working with multiple mathematical vectors (one for each image), life gets a lot easier if they’re the same size. I was basically being lazy, since if I had deal with a different size and aspect ratio with every new image, I would need to write some boring code to deal with that. Bollocks to that, thought I, I’ll just make them all square.

So far so dumb. But I still had decision which square to retain from each image (presuming the image was not already square), and since I intended to run algorithm against hundreds of images, there was no way that I was going to manually choose the ideal crop for each one. So, applying my ongoing, borderline pathological, policy of brevity, I googled some code (this was back in the hazy mists of time before ChatGPT saved the world) that simply cropped the largest possible square from the centre of the image. Behold:

def crop_center(self, img, crop_width, crop_height):  
    img_width, img_height = img.size  
    return img.crop((  
        (img_width - crop_width) // 2,  
        (img_height - crop_height) // 2,  
        (img_width + crop_width) // 2,  
        (img_height + crop_height) // 2)  
    )

Some serious shit going on there, right? If you don’t understand it, don’t worry, it’s not important. Just savour the glorious majesty of the resulting image:

Original Image - a woman with crossed arm — Original Image

Cropped Image - crossed arm woman with half her head missing — Original Image

I think you’ll agree with me, that that’s, at best, a pretty insensitive crop. Which is less than ideal if what you needed was for the integrity of the original image to be largely retained through the process. But I didn’t, and what I saw in this image, and many like it, was a thing of wondrous beauty. A composition so divergent it borders on blasphemy. Ejection from traditional art establishment in a 500 x 500 matrix. A compositions that I would never have thought to use, or have the balls to select, in a millions years. It was love at first sight!

So what was the actual revelation here?

Bizarre compositions can be very pleasing
Computers are really great at messing things up

(I’d like to point out that this latter revelation pre-dates ChatGPT and its much lauded capacity for beguiling nonsense.)

These revelations got me thinking: what other perverse ways can I coerce a computer to mess up images?

Ironically, given how easy it was to elicit this specific behaviour, figuring out new ingenious ways for serendipitous image corruption was oddly hard. Where to even start? My algorithmic approaches to date were interesting, but hardly mind-blowing. I started hunting around in the wider field of digital generative art, and found some fantastic stuff, but also some scary maths that I didn’t much feel like grappling with (although I will no doubt revisit this area at some point) and lots of largely tedious AI based approaches. However, my skulking in these dark crevices of digital creativity did resurface some arcane knowledge derived from my day job as a data scientist and analyst from a good 6 or 7 years previous when I was messing around with language models. Specifically, the working of a Neural Network type algorithm crafted at Google that was all the rage at that point called Word2Vec. Word2Vec is a direct ancestor to the modern Large Language Models such as ChatGPT. It’s what’s generically known as an autoencoder. I won’t get into the nuts and bolts of autoencoders as language models here, suffice to say that their job is to take text in at one end, and then reproduce it at some later time, as best it can. In their most basic form, then are overly complex and poorly performing compression algorithms, but in their more refined form are, well, extremely useful complex and poorly performing compression algorithms. But you’ve seen the conjuring tricks they perform via your surreptitious use of ChatGPT and no doubt use them already to churn out the sort of boring prose that you secretly used to enjoy churning out yourself. (This post was produced entirely by me and Neovim by the way, so pipe down already!).

Anyway, it took a single google search to discover that same principle can be, and has been, applied to images, and it got me pondering: I wonder what sort of weirdness an autoencoder would produce if I gave it lots and lots of images to memorise and not enough “memory” to remember them. Surely it would start to mix all the images up in fun and delightful ways?

So I immediately set about cutting and pasting bits of python code from a tutorial I found with the documentation of foundational Neural Network library Keras and a few hours later I had my results, and whoa where they results! The rest, as they say, is history. I set up a Instagram account and started to share all my delightfully warped and frequently spooky images.

For those desiring a little more specificity for what I actually did, I will elaborate a little here, but feel free to skip this bit if technical tomfoolery doesn’t float your boat.

Before I get into the tasty neural network action, a quick note on the wider technological landscape. All of the fun and wonder was created with the combination of one or more of the following elements:

Python
Numpy
Keras with a Tensorflow backend fot the neural network
The PIL image processing library

Like any Deep Learning style neural network, an autoencoder uses one or more layers of interlinked hidden layers. For my proposes, how these interlink is not particularly important, every node in each layer links to every node of the next. Where a more standard deep learning network, for example some sort of classifier, would terminate in a very narrow output layer (e.g. 1 neuron for a binary classifier), the autoencoder output layer has the same number of neurons as there is pixels in the input, namely the number of pixels in your input images. The output of that layer is scaled back up and reconstituted into an image, which is what I share. All my images are of uniform size, 500px² initially, but higher resolution now, which, for the sake of brevity, I flatten to a 1-dimensional 250,000 wide vector, scaled from the standard 0-255 range to a unit vector for the usual reasons. The autoencoder is tasked with “minimising” the error (loss) function (I tried various, but MSE gives me the most satisfying results) of the output when compared to the original input vector. You’ll note that in the prior parentheses that I used the word “satisfying”. This was not an accident. The task here is not to produce the best, or most accurate, result, it is to produce the most artistically satisfying result, which means I specifically do not want to absolutely minimise the error function or even close, since that would result in the original images being reproduced near perfectly, which is no use at all (at least for my use-case). So instead I want to reduce the error function to the degree to which I get something that isn’t just random noise, but also isn’t a faithful representation of the image. How do I do this? I use some or all of various approaches, the point being to “constrain” the network to prevent it from doing its job properly, for example I can:

Stop the training process prematurely when it has reduced the error function to some specific value or percentage
Train it on far too many images for the size of the network to remember, which is almost the same as…
Giving the network too few layers or neurons or connections to encode all the image information
Making the central layers ridiculously small
Use wildly divergent images (e.g. a mix up landscapes with portraits)
Dropout layers and similar

All this while trying different hyperparameters, layer configurations and error and activation functions (ReLU tends to work the best) and combinations thereof. There is some method and theory behind my experimentations but, since there is no specific, desired output, just noodling around with stuff is just as effective.

The worst of the technical detail ends here, but I can’t promise that some won’t leak out during the rest of this overly long post, soz.

By definition, the model created by any combination of configuration is also a product of the images that are input. I deliberately don’t input all images, since I have thousands and it would take too long to train on my Mac, but it’s also not desirable to do so. The crazy randomness is in part a function of the images that are chosen and the selection of images is part of the parametrisation and vital to the creative process. By controlling how many images input, even when chosen at random, I can effect the final result – just as I control my palette and brush strokes when I paint. I have also given myself the ability to include specific images to tailor the effect of the final output. Via this method I can create collections of similar or thematically linked images.

There is a separate module whose responsibility is to grab some images and prepare them for the training, and which offers further opportunities to affect the outcome of the model by, for example, tweaking the contrast of the input images. It also allows me to experiment with different resolutions (I’ve managed to train effective models of up to 1500px²). In theory, the higher resolution the better, but the exponential growth of the input vector places some practical limits given my hardware, and varying the resolutions produces different effects, so constraining this is sometimes desirable. There’s also the issue that many of the input images are of lower resolution, which is not a problem for the model and almost certainly is responsible for some of the interesting effects created.

And as for those interesting effects? See below, they’re fascinating, aren’t they? The model mixes up and mutates the source images blending and bleeding them into one another in bizarre and unpredictable ways. Some of the source images contain text which leaks through in enticing and haunting ways. Faces peep out from the trees from which they are growing, or blend into and merge with other faces. Planets mingle with beasts and butterflies appear to emit human language. There’s a haunted unity that creates cohesion among the hideous divergence. When staring at grids of these images I feel like the demoniac substrata of the universe is malignantly whispering to me like space-warped tarot cards. This is the esoteric language of Lovecraft’s Old Gods writ in digital form (more on this later).

A grid of bizarre, square, monochrome images

Because there is no right answer, and because I want unique images, I don’t keep the individual models created. They’re also pretty big and I don’t want to waste the disk space on them. So in effect, every image is unique, although certain input images seem to resonate more highly with the training algorithm, and so tend to turn up more prominently every time they are included in the input. I do, however, record the input parameters of the models that produced the most interesting and impressive images and reuse these. There is a fairly narrow range of parameters and inputs that yield good results, within which I still have a inexhaustible range of combinations to play with (as well as an ever growing repository of digital imagery). Many combinations and ranges of parameters merely yield white-noise or conversely overly accurate recreations, and can be discarded.

So the model gets trained from scratch every time it runs, like starting with blank canvas (metaphor both intended and unintended). Every time is a surprise and a wonder. The number of output images is exactly the same as that input, since the job of the machine is to reproduce the input images, or at least attempt to. The best models need at least 100 input images, so there are that many outputs to sort through and select the best from. Even the most effective models produce a large number of duds, by which I mean images that are completely unusable, usually because they are white-noise, or close to it, or too faithful a recreation. These can be discarded. Among the rest, many are not particularly interesting, but at an estimate, one in ten is a gem! This ratio varies greatly across the batches. Even with some home-grown tools to aid the filtering and selection process, it’s pretty laborious, but strangely dopamine filled, so it pleases and soothes my ADHD brain. I rarely adjust or doctor the images output – they are presented as is.

I could ramble on about this for a lot longer – it’s a fascinating and weirdly compulsive process, but I’ve already gone on too long. Maybe I’ll elaborate on a few of the areas that that I’ve glossed over at some point.

So what do I do with all these images, and how does that relate to my main body of work? Well, for the latter, initially at least, nothing at all. I set up a dedicated Instagram account, an obvious choice given how, when viewed via your profile page, the images are neatly arranged in grids of squares. Over time, the whole thing evolved to a convoluted, H.P. Lovecraft inspired hauntological framework called The University of Wilmarth Faculty of Eschatology. This is an ongoing labour of love, and houses thousands of the raw outputs of the models. Where it’s going, I’m not sure, but it’ll be fun finding out!

Along the way I used the images for a more abstract inspiration to my painted works. Mostly this was by way of more adventurous compositions, and via the introduction of more random and organic elements to the physical process. The images themselves were too intricate and disordered to scale up as direct references, and my original intention was to keep the two realms of my work separate. But I loved the images so much I had a nagging desire to see them on hanging next to my other art. Also, and inspired by the obsessively square digital art of my good friend and one of my favourite contemporary artists Mark Burden, I had an urge to see my works in uniform grids in the physical world.

So I set about figuring out how to do this. I could have just had them printed on nice paper and had them framed, but that felt a little lazy, especially since the creative process is already a little bit “factory”. It’s still possible that I’ll do this, but the more natural approach, given my tendency to incorporate elements of collage into my work, was to transfer the images to canvas where I could embellish and accentuate them. My smaller pieces are simply printouts from my fairly pedestrian home inkjet printer which are affixed to canvases using acrylic medium. I then go at these with acrylic inks, spray paints and the like. For the bigger one I scale up the images across multiple sheets and after which I follow a similar process. Others are simply scaled up by hand, and result in more “traditional” paintings. Regardless of which approach I use, I find the process highly enjoyable and creatively stimulating. It’s all win.

There’s a fair amount of extra bits that happen along the way, but that’s the gist of it. Having written this up, it feels like a ridiculously long, complex and convoluted process. I was worried that others might have thought me lazy for using computers and AI as part of my creative process! And I do consider this a deeply creative endeavour. At every step of the process I must make choices that affect the outcome, and the images that I choose to collect (as inputs) and share (as outputs) is based on decisions I make that are deeply peculiar to me. Were I to hand over all the code to anyone else (and I’m not averse to tidying it all up and open sourcing it at some point), I feel absolutely certain that the outputs they create would be dramatically different from mine, just as if someone else co-opted my studio and used all my materials and used the same reference images, they would yield dramatically different results. So concerned was I for a while that this process would be considered “cheating” or somehow creatively moribund, I considered not disclosing it at all. But I’m genuinely proud to have devised a such an innovative and unique approach. Maybe the works aren’t to everyone’s taste, but the same goes for the rest of my work, and anyone else’s for that matter.

For the most part I have paused my use of this approach, purely because other creative shiny things distracted me. I fully intend to resume this creative avenue and perhaps evolve it further in the future. One potential method for expanding it is to reintroduce the colour. I suspect that the warping of the hues could yield some really interesting results, or maybe just a brown mess! On the other hand, I’m really quite attached to the sea of monochrome. We’ll see I guess. There are also a plethora of other approaches to image vectorisation and model creation (e.g. convolutions) to play around with. Not to mention the essentially infinite choices of how to use the output images. In the meantime, my pet AI rests placidly in it’s cage, I’ll give it some attention when it starts gnawing at the bars.

March 7, 2016

Chinese Whispers Graph Clustering in Python

I needed a simple and efficient unsupervised graph clustering algorithm. MCL is a bit heavy for my needs and I was after something that was available in pure Python (because of environment access and compatibility issues) pretty much immediately. There isn’t exactly a lot of choice! I stumbled across Chinese Whispers an elegant and simple solution. I couldn’t find a simple implementation in Python so I created one myself using the formulas on the original paper. It uses NetworkX (for convenience – you could easily implement without this) and is incredibly fast.

import networkx

# build nodes and edge lists
nodes = [
    (1,{'attr1':1}),
    (2,{'attr1':1})
    ...
]
edges = [
    (1,2,{'weight': 0.732})
    ....
]

# initialize the graph
G = nx.Graph()

# Add nodes
G.add_nodes_from(nodes)
# CW needs an arbitrary, unique class for each node before initialisation
# Here I use the ID of the node since I know it's unique
# You could use a random number or a counter or anything really
for n, v in enumerate(nodes):
    G.node[n]['class'] = v

# add edges
G.add_edges_from(edges)

# run Chinese Whispers
# I default to 10 iterations. This number is usually low.
# After a certain number (individual to the data set) no further clustering occurs
iterations = 10
for z in range(0,iterations):
    gn = G.nodes()
    # I randomize the nodes to give me an arbitrary start point
    shuffle(gn)
    for node in gn:
        neighs = G[node]
        classes = {}
        # do an inventory of the given nodes neighbours and edge weights
        for ne in neighs:
            if isinstance(ne, int) :
                if G.node[ne]['class'] in classes:
                    classes[G.node[ne]['class']] += G[node][ne]['weight']
                else:
                    classes[G.node[ne]['class']] = G[node][ne]['weight']
        # find the class with the highest edge weight sum
        max = 0
        maxclass = 0
        for c in classes:
            if classes[c] > max:
                max = classes[c]
                maxclass = c
        # set the class of target node to the winning local class
        G.node[node]['class'] = maxclass

Given its simplicity it’s a remarkably effective algorithm. The image below shows a Gephi visualisation using the ForceAtlas2 algorithm. The node colours show the clusters identified by CW.

As you can see, the two algorithms broadly agree. CW took seconds to run whereas Gephi taxed my CPU to the max for many minutes (actually I think it was 10’s of minutes).

If you can fit you data into a graphical form, this is a very viable alternative to K-means style clustering, made even more attractive (for certain tasks) by the fact that it’s parameter free and thus you don’t need to pre-define the number of clusters (the bane of many a data scientist). It just finds the clusters that are there. This is, of course, also a draw-back in some circumstances. For example, if your data is heavily interlinked (a high degree to cardinality ratio) with no natural subgraphs, CW may just find a single cluster where you can demand K-means go find some. You can get around this to extent by relaxing your edge weight threshold (i.e. induce a subgraph with only edge weights greater than a threshold, then cluster that) an approach prone to graph fragmentation, which may or may not be desirable. It’s also prone to finding micro-clusters which in many cases could be construed as noise.

For my purposes it works incredibly well and I assume scales well. So let’s all turn to the left and tell the next person all about it.

Data Science

March 3, 2016

Does Word2Vec Dream of Semantic Sheep?

I played with Google’s magical word2vec neural network some time ago. I found it interesting but I had no immediate use for it, so I filed it in the ‘must remember to check this out further’ section of my disorganised brain. More recently I found myself wrestling with topic modelling resulting in a near terminal headlock. I have a data set of very short (say, 2 – 50 word) documents that I want to group thematically. LDA and it’s various cousins were struggling with this task. There are few of reasons for this. Firstly, even though LDA, LSI etc. perform a sort of vector dimensionality reduction, this only works to a point. A three word sentence contains barely enough information to derive a single topic from, let alone a distribution, and consequently my topic distributions were too sparse to do much with. Secondly, and exacerbating the first point, the topic space across the corpus is pretty limited and somewhat homogeneous. Thirdly, and exacerbating the other two points, there is marked lack of variety in the language used across the corpus, and from one topic to the next. Humans were struggling to distinguish one theme from another, so what chance did a computer have? I was just about to give it up as a lost cause when I remembered that word2vec has some similarity measures, and a vague recollection of someone suggesting it could be used for topic modelling. My basic theory here is that if I can compare sentences for similarity I should be able to group them via that similarity (I’m using a graph clustering model to do this). So as a last ditch effort I ran word2vec over my corpus and started playing around to see if it could make sense of my data. The results were phenomenal! The similarity graph created from a simple, untuned word2vec model outperformed the other models at unsupervised classification 10 fold at least – where before I saw only loose semantic groupings with many mis-grouped items, I now saw empirically cohesive and accurate groups. As pleased as I was by this turn of events, I didn’t understand why Google’s simple neural network worked at all for my purposes, let alone outperformed everything else. So I bathed myself in warm, welcoming, buoyant sea of word2vec’s vector space. As I did so, I started to appreciate word2vec’s spooky action.

It’s not my intention to repeat what’s already been said about word2vec but merely to state my own findings. I’ll start with the crux of my main confusion about the model. I’m using to to interpret and group customer feedback. My understanding of word2vec is that it groups words that are semantically similar, or at least proximal. You can explore the most similar words to any given word or words easily in word2vec. My model was trained on a million or so items of customer feedback from a website survey. I should be swift here to mention that they have an excellent site that works well for the vast majority of customers, but like every website, doesn’t preform well for everyone all the time. So two words that occur in close proximity a fair amount are “site” and “slow” (it’s also worth pointing out that “site” actually co-exists with “fast” more often in the same corpus, however, we’re looking for problems to solve, not praise to lap up). However, when I looked at the top 50 words in closest proximity with “site”, “slow” was nowhere to be found. I get loads of similes of the the word “site” (e.g. website). And and all the words most closely related to “slow” are other words that loosely mean slow (e.g. sluggish). It was obvious to me at the point what word2vec was actually doing in this instance, and that my expectation was not aligned with how it works, but this lead me to a bit of an epiphany – holy shit, word2vec understands actual semantic relationships between words without any formal teaching; purely by inference! To put it another way, it groups together words it rarely if ever sees together, even in the same document. That’s pretty clever for a simple neural network with only a single hidden layer. I also picked out similarities in misspelled words. This is deceptively helpful since one of my core frustrations with the data sets (at least for the purposes of supervised learning and approach I’m also using) was that people don’t tend to put much effort into spell checking what they enter into an online survey. That’s some pretty spooky action!

So, on the surface at least, it seemed that the reason my similarity scores were so on the mark was that word2vec was cleverly able to pair of similar words in a sentence and thus was able to create robust similarity scores. However, this explanation is a bit Newton to Einstein – a good explanation, but not the whole story. Word2vec’s spooky action is a lot more abstract and, dare I say it, mysterious. This deserves a little more probing. The model that word2vec actually produces is actually the single hidden neural network layer previously referenced. It consists of a n x m vector space where n is the number of words in your corpus (optionally pruned to get rid of infrequent or too frequent cruft) and m is the an arbitrary number of floating point dimensions usually some where between 100 and 700. These dimensions are somewhat intractable from a human perspective. They are quanta of a continuous abstract vector space that maps a territory of words in a sort of semantic relief map. Words of similar meaning exist in the same general area of the map. Tribes of words exist in a single area just like tribes of people do in the real world. This extends past similes to words of the same type however. I ingested the prebuilt vector space on the word2vec home page (the 300 dimension Google news one) and did some exploring. I discovered various bits of spooky action:

Names of musicians (Alice Copper, Ozzy Osbourne, and David Bowie) coexist together with bands (Metallica, Motorhead) all of which bare no relation to, say “cheese”
Names of US presidents occupy the same general space, with small offsets that seem to suggest political affiliations (more research needed here), as do scientists with a little evidence that they cluster with their respective disciplines
Parts of the brain (and indeed neurotransmitters) all occupy the same space, and a cursory appraisal suggests that closer proximity exists form those parts closer to each other (hypothalamus, nucleus accumbens and midbrain all cluster very closely). One assumes that this is the same for all anatomical parts
It has no sense of opposites – fast and slow cluster very closely together, black and white even more so
Words cluster together when they are notionally similar rather than the same type of word, so “black” and “blackest” cluster close together. There seems no definable continuous space for, say, nouns, proper nouns, adjectives etc.

There’s a sense that it clusters words when they seem interchangeable to a greater lesser degree. The mathematical offset described with the famous “king – man + woman = queen” example seems to reinforce this. The spacial significance comes further to light when you consider the two main ways to interrogate the vector space. The standard way (al la Gensim and others) is to scan the entire vector space for vectors with the closest cosine similarity to the vector of the target word (which is usually something in the same proximity). When you’re dealing with multiple words (e.g. n-grams, sentences or even whole documents) the approach is simply to find the exact vector for each word, then take a column level average across those words to create a new vector which then goes through the same cosine similarity malarchy. When comparing one sentence/document to another, we take the same average and get the cosine between the two. The second approach looks to take an actual numerical offset from one words, or collection of words, to another as per Kusner et al. In combination these two approaches make the topological aspect of the model more salient still suggesting that the word embeddings exist in some intangible semantic space-time of numerous dimensions and geometry which, in a very real sense, is exactly what it is. A simile generator may be a convenient way to describe it, but the reality is much more complex and elegant. So back to my original idea that word2vec was pairing off words in a sentence. This is not the case at all. Using the cosine similarity approach as I was, what was actually happening is that I was generating vector point constructed from an average of a collection words, then go find other word or words that are proximally (spatially) close. Words are never compared, we actually just go find the tribe that has the most DNA in common with my sentence, as it were.

So how does it figure this stuff out? Well, some much better mathematicians than me already concluded that they “don’t really know”, so what hope have I got? I can just try and make some sense of what I observe. Much has been made of the difference between CBOW and Skip-gram as way to evaluate the text, however there seems little to suggests that either contributes to the overall spooky action, rather than build upon the the mysterious workings of a simple neural network. The information is in there, in all the written text, and nothing is inferred that isn’t visibly available – there’s no extrapolation going on here, no logic. Word2vec doesn’t read it or understand the text, it just picks up patterns. Interpretation as an adjunct to semantic awareness is a job for a future, much more sophisticated algorithm or model or AI. The best analogy I can think of for word2vec is the very mechanism that neural networks try and emulate – the human brain. In particular long term memory. It’s easy to imagine that, as information flows in through our senses, it is brokered into similar abstract representations in the cerebral cortex, then either reinforced (learned) or forgotten. We know that the best way to remember something is to relate it to something to something we already have a good sense of. Then when you recall that thing, you also recall a sense of the other stuff that you squirrelled it away with. Thus when you recall Alice Cooper from memory, Ozzy Osborne sometimes emerges with him, along with a bat or a chicken maybe, but never a block of stilton.

Data Science

July 10, 2015

17 seconds

I don’t know exactly how long it takes to read the 50 or so words attributed to me in a recent Guardian article, but I doubt that it equates to 5 minutes, probably more like 17 seconds, meaning that I still have the larger portion of my 5 minutes of fame to come. What wonders await is anyone’s guess, but in the meantime I will juxtapose those 17 seconds of written text with a note of clarification.

On enthusiastically posting a snippet of said article on Twitter (and while I sat back and basked in the adulation) an old friend and colleague and data guru @hankyjohn responded to one of my points with a contradiction. Specifically, I said:

But Loveless said he associated the idea of having a single customer view with “big, monolithic, old school, relational databases, which are horribly hard to manage and incredibly expensive”. Just collecting data on customers for its own sake is useless unless you can do something useful with it, he said: “You don’t need to understand everything about the customer, you don’t need to collect and structure everything about the customer, you just need to have a sense about them.” He said the new data management platforms do not promise a single customer view, just a general view of what that person likes and does.

To which @hankyjohn responded (quite correctly):

@alexmloveless good work. Can I disagree though? False dichotomy for me. Traditional data warehousing can coexist nicely with other stores.

There followed a brief exchange in which I heroically clarified my point. Rather than subject you to those stilted 140 character info-barks, I’ll summarise the crux of my points here.

Although I completely stand by the point illustrated in that article, it sits removed from a broader context that would have been apparent were you in the room at the time. The wider point is this: Since the days of when advertising was first invented (by the people on Mad Men) marketers and the like have endeavoured to understand their customers. Such understanding, for the vast majority of the intervening period, was derived from stuff we can learn from any detail we can collect about them (name, address, demos etc.) and performance data (what works and what doesn’t). The former data probably existed on bits of card in filing cabinets for a long time before eventually being diligently transmogrified into their digital equivalent when computers became a thing. These digital equivalents eventually required a structured form so that they could be easily accessed and queried for the purposes of selling us stuff that we don’t need. The medium for this structure was the humble database, of which for a long time there was really only one form worth talking about, the RDBMS – relational databases. Relational databases are marvellous. They impose structure on unruly data and make it easy to access, analyse and aggregate. Thus, modern marketing became used to using these things to store their customer data, which needed to be kept clean and tidy. This was how you knew who your customers were – you kept records of them in a big old RDBMS called “Customer DB” or “CRM Store” or something equally as enticing. Problem is, since there were many different sources of data, companies frequently ended up with multiple stores, often storing overlapping data sets. Quite rightly at some point marketers and IT people alike started saying things like “wouldn’t it be great if all this data was deduplicated and stored in once place” and thus was born the dreaded Single Customer View.

Roll on a decade and SCV projects that were started on the back of wishes from marketers are still incomplete and running up legacy costs of tens of millions. Meanwhile, while failing to deliver on the meagre requirements of the time, we now have all these bloody channels and social networks and mobile devices and internets-of-things and Bigness of Data. Asking IT to justbloodywell get me a dataset I can trust is trouble enough let alone incorporating twitter handles and cross device awareness. Yet marketers are still asking such things of an SCV thinking that this once-so-called magic data bullet is actually the right place for such things.

The belief is still widely held that customer data really only can live in a big ole monolithic relational data store. This comes from a lack of distinction perpetuated on both the marketers and IT people’s part. The distinction is between Master Data Management (MDM) and, well, all the other types of data. It’s a distinction between hard, indelible customer data for the purposes of hard, lofty uses, vs the sort of fuzzy profiling that proliferates across the web and haunts you with depressing display adverts for TV’s you had briefly considered buying before that whopping council tax bill came in.

Modern marketing data is not about coherent customer information, it’s about cookies and inferred data. When marketing to (or at) someone it’s more useful to know their gender than their name. A mobile geolocation is better than a postcode. A constantly evolving stream of inferred preference data is better than a mosaic classification. This is all achieved by a web of data collection technologies and services that use the humble cookie as their primary currency and couldn’t give a hoot what your name is. You could try and mash all this lovely data into your SCV but you’d end up changing your schema every two weeks and probably hit performance/scale issues pretty quickly. Plus it’ll take 6 years and countless more million quid when you could have invested in one of those mystical unicorn DMP thingies. In such a circumstance your beloved SCV data would mostly be flowing in the other direction and consequently making an anonymised cookie store the most complete view of your customer data. God forbid!

Now don’t go rushing off to Adobe or Oracle while instructing your IT team to delete that pesky SCV. You probably need it. Email comms would not be possible without it. And if you have a more tangible relationship with your customer (like, you sell stuff to them) you need a master record with accurate, non-volatile information about them that’s nicely structured, secure and private. This is Master Data Management, and only relates secondarily to marketing. And as the learned @hankjohn correctly points out, it sits happily and harmoniously in a mature data ecosystem with anarchic jonny-come-latelys like DMPs (and a bunch of other sinister data entities).

This was the thrust of my grumpy diatribe at the Guardian offices, which perhaps doesn’t come through too well in the article. I wasn’t misquoted as such, just underquoted. The moral of this story? Write more about me.

Data Science