Word Embedding Implemented in Python

WordEmbedding_Python

Word Embedding Training in Python

Backpropagation gradient calculation should be straightforward once you did any Affine (fully connected) neural network, or Convolution Neural Network (CNN) and Softmax classifier. One example showing the gradient calculation and backpropagation can be found here.

"You shall know a word by the company it keeps" - John Rupert Firth

Word Embedding has become very popular since Thomas Mikolov's Word2vec paper was published. We are going to compress words from high-dimension, sparse space (vocabulary) to lower-dimension, continous vector space. It is pretty amazing to see certain linguistic, semantic relations can be encoded and linear transformation can be applied. For example, vector('Rome') is close to vec('Paris') - vec('France') + vec('Italy').

Continuous Bag of Words (CBOW) and Skipgram are two models commonly used to extract this relation into vectors. Comparing with other optimization problems, Word Embedding training has one difficulity - the size of vocabulary is too big to process efficiently, and evaluation is needed for each step. To get around the expensive evaluation of classifier cost (such as applying Softmax prediction + Cross-Entropy loss) scores on the complete vocabulary, negative sampling method can be used to approximate probabilities from Softmax classifier.

This project documents some naive implementations on Word Embedding training. Many functions should be fine-tuned in order to build an efficient deep learning framework

Part of the dataset processing code is adapted based on TensorFlow tutorial on word2vec

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import scipy.io
import scipy.misc
import time
import sys
import os
import glob
import random
from random import randrange
import zipfile
import collections
import cPickle as pickle

from sklearn.manifold import TSNE
from six.moves import urllib
from six.moves import xrange # pylint: diable=redefined-builtin


# Setup Matplotlib and autoreload
%matplotlib inline
%load_ext autoreload
%autoreload 2
In [2]:
# Because we are not using any existing flow/package, we need to define all building blocks in Python
# 
def normalize_vecs(x):
    """ Row normalization function 
    
    Normalize each row of input matrix
    
    - Inputs:
        x: vectors (matrix)
    - Outputs:
        x: normalized vectors (matrix)
    """
    xl_vec =  np.sqrt(np.sum(x**2, axis=1, keepdims=True))
    x = x/xl_vec
    
    return x

# Sigmoid logistic function
def logistic(x):
    """
    Apply logistic function to input vector
    
    Inputs:
         - vector x with shape (N, D)
    Outputs:
         - out    # Output vector of shape (N, D)
         - dx     # gradient with regarding to x, shape (N, D)
    """
    out = 1/ (1 + np.exp(-x))
    dx = out * (1 - out)
    
    return out, dx

# Softmax function
def softmax(x):
    """
    Calculate softmax based probability for given input vector
    """
    x -= np.max(x, axis=x.ndim-1, keepdims=True)   # Numerically safe operation
    exp_x = np.exp(x)
    p = exp_x / np.sum(exp_x, axis=x.ndim-1, keepdims=True)
    
    return p

# Softmax cost layer
def softmax_cost(x, w, target_idx):
    """
    Calculate softmax cost and gradient
    No bias in this implmentation
    
    Inputs
        - x            # vector from previous layer, such as hidden layer
        - w            # weight for last layer, so called "output vectors"
        - target_idx   # Target word index, ground truth
    
    Outputs
        - cost         # Softmax + cross-entropy cost for current predications
        - dw           # Gradient with regarding to weights
        - dx           # Gradient with regarding to input vector
    """
    V, D = w.shape   # V: vocabulary size, D: Vector dimension
    x = x.reshape((1, D))
    # x.shape is (1, D)
    prob = softmax(x.dot(w.T))    # (1, V)
    cost = -np.log(prob[0, target_idx])  # Cross-entropy loss, Grand truth distribution is one hot
    dcost = prob                  # (1, V)
    dcost[0, target_idx] -= 1
    dw = dcost.T.dot(x)           # (V,1) * (1, D) ==> (V, D)
    dx = dcost.dot(w)             # (1,V) * (V, D) ==> (1, D)
    
    return cost, dw, dx.flatten()

# Naive negative sample generator to randomly pick negative samples
# Paper recommend to use probability^(3/4), here we just use uniform distribution
def get_random_samples(target_idx, data_size, sample_size):
    """
    Randomly pick sample_size of samples for negative sampling
    """
    indices = [None] * sample_size
    for i in xrange(sample_size):
        new_idx = random.randint(0, data_size - 1)
        while new_idx == target_idx:
            new_idx = random.randint(0, data_size - 1)
        indices[i] = new_idx
    return indices
        

# Negative sampling cost layer
def negative_sampling_cost(x, w, target_idx, sample_size=10):
    """
    Purpose is to replace softmax cost layer to handle large vocabulary size
    Using randomly picked samples to adjust prediction and provide feedback
    
    Inputs
        - x            # vector from previous layer, such as hidden layer
        - w            # weight for last layer, so called "output vectors"
        - target_idx   # Target word index, ground truth
        - sample_size  # Size for negative sampling
    
    Outputs
        - cost         # Softmax + cross-entropy cost for current predication
        - dw           # Gradient with regarding to weights
        - dx           # Gradient with regarding to input vector
    
    """
    V, D = w.shape
    dw = np.zeros_like(w)
    dx = np.zeros_like(x)
    indices = [target_idx]
    indices.extend(get_random_samples(target_idx, V, sample_size))
    #print "indices = {}".format(indices)
    labels = np.array([1] + [-1 for i in range(sample_size)])   # (sample_size + 1,)
    wvecs = w[indices]                                           # (sample_size+1, D)
    z = np.dot(wvecs, x.T)    # (sample_size + 1, D) dot (D) ==> (sample_size + 1)
    z = z * labels
    scores, _ = logistic(z) # (sample_size + 1)
    cost = - np.sum(np.log(scores))   # A scaler 
    # Calculate gradients
    dz = labels * (scores -1)   # (sample_size + 1)
    dz = dz.reshape((1, sample_size + 1))
    dx = dz.dot(wvecs)    # (sample_size + 1) dot (sample_size + 1, D) ==> (1, D)
    dwvecs = dz.T.dot(x.reshape((1, D)))  # (sample_size + 1, 1) dot (1, D) ==> (sample_size + 1, D)
    
    # Copy gradient for sample vectors to complete dW array
    for i in range(sample_size + 1):
        dw[indices[i]] += dwvecs[i,:]
    
    return cost, dw, dx.flatten()


# skipgram function
def skipgram(center_word, context_size, 
             context_words, input_vectors, output_vectors, 
             cost_func=negative_sampling_cost):
    """
    Skip-Gram model
     - Inputs:
        center_word: index of current center word
        context_size: size of context window
        context_words: list of word indexs of context words
        input_vectors: word vectors from input to hidden layer, for lookup
        output_vectors: word vectors from hidden layer to final cost function
        cost_func: cost function, such as softmax_cost or negative_sampling_cost
     - Outputs:
        cost: evaluated cost value
        grad_in: gradient with regard to input_vectors
        grad_out: gradient with regard to output_vectors
    """
    cost = 0
    grad_in = np.zeros_like(input_vectors)
    grad_out = np.zeros_like(output_vectors)
    
    center_word_vec = input_vectors[center_word, :]
    
    for target_word_idx in context_words:
        c_cost, c_dw, c_dx = cost_func(center_word_vec, output_vectors, target_word_idx)
        cost += c_cost
        grad_in[center_word, :] += c_dx
        grad_out += c_dw
        
    return cost, grad_in, grad_out
        
# continuous bag of words
def cbow(center_word, context_size, 
             context_words, input_vectors, output_vectors, 
             cost_func=negative_sampling_cost):
    """
    CBOW model
     - Inputs:
        center_word: index of current center word
        context_size: size of context window
        context_words: list of word indexs of context words
        input_vectors: word vectors from input to hidden layer, for lookup
        output_vectors: word vectors from hidden layer to final cost function
        cost_func: cost function, such as softmax_cost or negative_sampling_cost
     - Outputs:
        cost: evaluated cost value
        grad_in: gradient with regard to input_vectors
        grad_out: gradient with regard to output_vectors
        
    """
    cost = 0
    grad_in = np.zeros_like(input_vectors)
    grad_out = np.zeros_like(output_vectors)
    V, D = input_vectors.shape
    
    onehot = np.zeros((2*context_size, V))
    target_word_idx = center_word
    #print context_size, context_words
    for i, w_idx in enumerate(context_words):
        onehot[i, w_idx] += 1
    
    d = np.dot(onehot, input_vectors) # (2C, V) dot (V, D) ==> (2C, D)
    cbow_vec = np.sum(d, axis=0)/(2 * context_size)  # Average vectors' weight, shape (1, D)
    cost, grad_out, grad_cbow_vec = cost_func(cbow_vec, output_vectors, target_word_idx)

    for word in context_words:
        grad_in[word] += grad_cbow_vec/(2 * context_size)
    
        
    return cost, grad_in, grad_out
In [3]:
# Check gradient   
def gradcheck_naive(f, x):
    """ 
    Gradient check for a function f 
    - f should be a function that takes a single argument and outputs the cost and its gradients
    - x is the point (numpy array) to check the gradient at
    """ 

    rndstate = random.getstate()
    random.setstate(rndstate)  
    fx, grad = f(x) # Evaluate function value at original point
    h = 1e-8

    # Iterate over all indexes in x
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        ix = it.multi_index
        org_fx = fx
        org_x = x[ix]
        x[ix] = org_x + 0.5*h
        random.setstate(rndstate)
        fx_a, _ = f(x)
        x[ix] = org_x - 0.5*h
        random.setstate(rndstate)
        fx_b, _ = f(x)
        fx = org_fx
        x[ix] = org_x
        numgrad = (fx_a - fx_b) / h

        # Compare gradients
        reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
        if reldiff > 1e-5:
            print "Gradient check failed."
            print "First gradient error found at index %s" % str(ix)
            print "Your gradient: %f \t Numerical gradient: %f" % (grad[ix], numgrad)
            return
    
        it.iternext() # Step 
    print "Gradient check passed!"


    
def getRandomContext(context_size,
                     dataset):
    span = len(dataset)
    centerword_idx = random.randint(0, span-1)
    context = dataset[max(0, centerword_idx - context_size): centerword_idx]
    if centerword_idx + 1 < span:
        context += dataset[centerword_idx + 1: min(span, centerword_idx + context_size + 1)]
    #print "centerword = {}, context = {}".format(dataset[centerword_idx], context)
    return dataset[centerword_idx], [dataset[w] for w in context]

def word2vec_sgd_batch(model, word_vectors, dataset, context_size, word2vec_cost_func=negative_sampling_cost):
    """ Run word2vec stochastic gradient descent for a set (batchsize) of randomly picked data"""
    batchsize = 50
    cost = 0.0
    grad = np.zeros(word_vectors.shape)
    N = word_vectors.shape[0]
    input_vectors = word_vectors[:N/2, :]
    output_vectors = word_vectors[N/2:,:]
    for i in xrange(batchsize):
        c_size = random.randint(1, context_size)
        centerword, context = getRandomContext(c_size, dataset)
        c, gin, gout = model(centerword, c_size, context, input_vectors, output_vectors, word2vec_cost_func)
        cost += c/ batchsize
        grad[:N/2, :] += gin / batchsize
        grad[N/2:, :] += gout /batchsize
        
    return cost, grad


### Check gradient with summy dataset
#dummy_dataset = ['abacus', 'b', 'c', 'd', 'e']
dummy_dataset = [0, 1, 2, 3, 4]
dummy_dict = dict([("abacus", 0), ("b", 1), ("c", 2), ("d", 3), ("e", 4)])

# Fix random seeds for numerical gradient check
random.seed(123456)
np.random.seed(654321)
dummy_vectors = normalize_vecs(np.random.randn(10, 3))

# Run Gradient check
gradcheck_naive(lambda vec: word2vec_sgd_batch(skipgram, vec, dummy_dataset, 5), dummy_vectors)
gradcheck_naive(lambda vec: word2vec_sgd_batch(skipgram, vec, dummy_dataset, 5, softmax_cost), dummy_vectors)
gradcheck_naive(lambda vec: word2vec_sgd_batch(cbow, vec, dummy_dataset, 5), dummy_vectors)
gradcheck_naive(lambda vec: word2vec_sgd_batch(cbow, vec, dummy_dataset, 5, softmax_cost), dummy_vectors)
Gradient check passed!
Gradient check passed!
Gradient check passed!
Gradient check passed!
In [4]:
# Time to get some read dataset to play with!
# We will use 'text8' dataset, a text-only version of Wikipedia dataset, trimmed to the first 10^8 bytes

dataset_url = 'http://mattmahoney.net/dc/text8.zip'
expected_bytes = 31344016

# To Do: add auto retry with delay if network connection error found
def may_need_download(url_link, dest_directory, expected_bytes):
    """
    Download files if not available locally
    Args:
         url_link: the target file for downloading
         dest_directory: local directory to save downloaded file
         expected_bytes: expected file size, used to make sure downloading is correct
    Return:
         filenpath
         """
    if not os.path.exists(dest_directory):
        os.makedirs(dest_directory)
    filename = url_link.split('/')[-1]
    filepath = os.path.join(dest_directory, filename)
    
    if not os.path.exists(filepath):
        def _progress(count, block_size, total_size):
            sys.stdout.write('\r>>> Downloading %s %.1f%%' 
                             %(filename, float(count * block_size) / float(total_size) * 100.0))
            sys.stdout.flush()
        filepath, _ = urllib.request.urlretrieve(url_link, filepath, _progress)
        print '\n'
        
    statinfo = os.stat(filepath)
    if statinfo.st_size == expected_bytes:
        print 'Successfully downloaded', filename, statinfo.st_size, 'bytes.'
    else:
        raise Exception('File ' + filepath + ' is damaged. Please try to download it again')
    return filepath

dataset_file = may_need_download(dataset_url, 'dataset', expected_bytes)
Successfully downloaded text8.zip 31344016 bytes.
In [5]:
# Read data into a list of strings
def read_data(filename):
    """
    Extract the file, it is a huge list of strings
    """
    with zipfile.ZipFile(filename) as f:
        data = f.read(f.namelist()[0]).split()
    return data

words_list = read_data(dataset_file)
print "Dataset type %s, with %d words" % (type(words_list), len(words_list))
Dataset type <type 'list'>, with 17005207 words
In [6]:
# Process data
def build_dataset(words_list, vocabulary_size):
    count = [['UNK', -1]]
    count.extend(collections.Counter(words_list).most_common(vocabulary_size - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words_list:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0 # map to unknown 'unk'
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reverse_dictionary

### Get Vocabulary ready
vocabulary_size = 20000
dataset, count, dictionary, r_dictionary = build_dataset(words_list, vocabulary_size)
del words_list
In [7]:
# Check some samples of the dataset 
print 'Most common words:', count[:5]
print 'Sample data:', dataset[:10], [r_dictionary[i] for i in dataset[:10]]
Most common words: [['UNK', 996665], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
Sample data: [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156] ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']
In [8]:
# Add parameter saving and restoring capability

def save_params(iter, params, cost_history, target_dir):
    """ Function to save paramerters for later to restore. """
    if not os.path.exists(target_dir):
        os.makedirs(target_dir)
        
    filepath = os.path.join(target_dir, "saved_params_%d.npy" % iter)    
    with open(filepath, "w") as f: 
        pickle.dump(params, f)
        pickle.dump(cost_history, f)
        pickle.dump(random.getstate(), f)
        
def load_saved_params(target_dir):
    """ Restore saved parameters and set iteration number. """
    start_point = 0
    for f in glob.glob("{}/saved_params_*.npy".format(target_dir)):
        iter = int(os.path.splitext(os.path.basename(f))[0].split("_")[-1])
        if iter > start_point :
            start_point = iter
    if start_point > 0:
        with open("{}/saved_params_{}.npy".format(target_dir, start_point), "r") as f:
            params = pickle.load(f)
            cost_history = pickle.load(f)
            state = pickle.load(f)
        return start_point, params, cost_history, state
    else:
        return start_point, None, [], None
    
    

def sgd_optimizer(func, x0, step, iterations, 
                  PRINT_EVERY=100,
                  SAVE_PARAMS_EVERY=2000,
                  use_saved=False, 
                  save_dir='saved_params_dir'):
    """ 
    Stochastic Gradient Descent Optimizer             
    
     Inputs:                                                         
     - func: the function to optimize, it should take a single        
         argument and yield two outputs, a cost and the gradient  
         with respect to the arguments                            
     - x0: the initial point to start SGD from                     
     - step: the step size for SGD                                 
     - iterations: total iterations to run SGD with
     - save_dir: directory for saving parameters
 

     Output:                                                         
     - x: the parameter value after SGD finishes  
     - cost_history: the history of cost function value
    """
    
    # Anneal learning rate every several iterations
    ANNEAL_EVERY = 5000
    x = x0
    start_iter = 0
    cost_history = []
    if use_saved:
        start_iter , old_x, cost_history, state = load_saved_params(save_dir)
        if start_iter > 0:
            x = old_x;
            step *= 0.5 ** (start_iter / ANNEAL_EVERY)
        if state:
            random.setstate(state)
    else:
        pass
    
    
    expcost = None
    
    
    for iter in xrange(start_iter + 1, iterations + 1):
        cost = None
        cost, grad = func(x)
        x -= step * grad
        cost_history.append(cost)
        
        if iter % PRINT_EVERY == 0:
            if not expcost:
                expcost = cost
            else:
                expcost = .95 * expcost + .05 * cost
            print "\r>>> iter %d: %f" % (iter, expcost)
        
        if iter % SAVE_PARAMS_EVERY == 0 and use_saved:
            save_params(iter, x, cost_history, save_dir)
            
        if iter % ANNEAL_EVERY == 0:
            step *= 0.5
    print "\n"
    return x, cost_history
In [9]:
embedding_size = 64
word_vectors = normalize_vecs(np.random.randn(2*vocabulary_size, embedding_size))
context_window_size = 1

trained_word_vector, cost_history = sgd_optimizer(lambda vec: 
                                        word2vec_sgd_batch(skipgram, 
                                                             vec, dataset, 
                                                             context_window_size),
                                        word_vectors, 
                                        1, 30000, use_saved=True)

#print len(dataset)
>>> iter 100: 15.142714
>>> iter 200: 15.127362
>>> iter 300: 15.097164
>>> iter 400: 15.010163
>>> iter 500: 14.846600
>>> iter 600: 14.618834
>>> iter 700: 14.298757
>>> iter 800: 14.051741
>>> iter 900: 13.763645
>>> iter 1000: 13.466327
>>> iter 1100: 13.084013
>>> iter 1200: 12.830128
>>> iter 1300: 12.493584
>>> iter 1400: 12.276524
>>> iter 1500: 12.077296
>>> iter 1600: 11.762810
>>> iter 1700: 11.465948
>>> iter 1800: 11.179325
>>> iter 1900: 10.937557
>>> iter 2000: 10.731226
>>> iter 2100: 10.509891
>>> iter 2200: 10.285919
>>> iter 2300: 10.084203
>>> iter 2400: 9.869337
>>> iter 2500: 9.586403
>>> iter 2600: 9.439124
>>> iter 2700: 9.284449
>>> iter 2800: 9.117748
>>> iter 2900: 8.927890
>>> iter 3000: 8.750479
>>> iter 3100: 8.567016
>>> iter 3200: 8.372594
>>> iter 3300: 8.173714
>>> iter 3400: 8.059736
>>> iter 3500: 7.852919
>>> iter 3600: 7.679299
>>> iter 3700: 7.461978
>>> iter 3800: 7.326921
>>> iter 3900: 7.174543
>>> iter 4000: 6.989613
>>> iter 4100: 6.842037
>>> iter 4200: 6.695531
>>> iter 4300: 6.558940
>>> iter 4400: 6.450024
>>> iter 4500: 6.361125
>>> iter 4600: 6.226748
>>> iter 4700: 6.113042
>>> iter 4800: 6.014114
>>> iter 4900: 5.919818
>>> iter 5000: 5.809129
>>> iter 5100: 5.711771
>>> iter 5200: 5.631438
>>> iter 5300: 5.514884
>>> iter 5400: 5.434048
>>> iter 5500: 5.324360
>>> iter 5600: 5.249820
>>> iter 5700: 5.220132
>>> iter 5800: 5.150067
>>> iter 5900: 5.041397
>>> iter 6000: 4.958067
>>> iter 6100: 4.882629
>>> iter 6200: 4.823477
>>> iter 6300: 4.719328
>>> iter 6400: 4.703851
>>> iter 6500: 4.668334
>>> iter 6600: 4.565740
>>> iter 6700: 4.531533
>>> iter 6800: 4.527520
>>> iter 6900: 4.477774
>>> iter 7000: 4.393866
>>> iter 7100: 4.349869
>>> iter 7200: 4.287739
>>> iter 7300: 4.188506
>>> iter 7400: 4.124938
>>> iter 7500: 4.062781
>>> iter 7600: 4.038402
>>> iter 7700: 3.969521
>>> iter 7800: 3.971548
>>> iter 7900: 3.921790
>>> iter 8000: 3.929343
>>> iter 8100: 3.884722
>>> iter 8200: 3.806265
>>> iter 8300: 3.757294
>>> iter 8400: 3.756317
>>> iter 8500: 3.716506
>>> iter 8600: 3.676878
>>> iter 8700: 3.657006
>>> iter 8800: 3.658382
>>> iter 8900: 3.606045
>>> iter 9000: 3.595908
>>> iter 9100: 3.548167
>>> iter 9200: 3.557864
>>> iter 9300: 3.488735
>>> iter 9400: 3.460848
>>> iter 9500: 3.429942
>>> iter 9600: 3.412726
>>> iter 9700: 3.378585
>>> iter 9800: 3.358519
>>> iter 9900: 3.343109
>>> iter 10000: 3.363491
>>> iter 10100: 3.363901
>>> iter 10200: 3.320365
>>> iter 10300: 3.284050
>>> iter 10400: 3.258947
>>> iter 10500: 3.226283
>>> iter 10600: 3.205318
>>> iter 10700: 3.199333
>>> iter 10800: 3.175485
>>> iter 10900: 3.174435
>>> iter 11000: 3.121339
>>> iter 11100: 3.102123
>>> iter 11200: 3.087775
>>> iter 11300: 3.073892
>>> iter 11400: 3.034521
>>> iter 11500: 3.016586
>>> iter 11600: 3.014720
>>> iter 11700: 2.994715
>>> iter 11800: 2.958088
>>> iter 11900: 2.979039
>>> iter 12000: 2.962898
>>> iter 12100: 2.922493
>>> iter 12200: 2.932794
>>> iter 12300: 2.892336
>>> iter 12400: 2.894966
>>> iter 12500: 2.881443
>>> iter 12600: 2.832328
>>> iter 12700: 2.856374
>>> iter 12800: 2.831711
>>> iter 12900: 2.863712
>>> iter 13000: 2.843840
>>> iter 13100: 2.847455
>>> iter 13200: 2.813263
>>> iter 13300: 2.817830
>>> iter 13400: 2.809909
>>> iter 13500: 2.809166
>>> iter 13600: 2.802369
>>> iter 13700: 2.792205
>>> iter 13800: 2.813283
>>> iter 13900: 2.813931
>>> iter 14000: 2.806737
>>> iter 14100: 2.804330
>>> iter 14200: 2.788540
>>> iter 14300: 2.807760
>>> iter 14400: 2.775908
>>> iter 14500: 2.740234
>>> iter 14600: 2.747506
>>> iter 14700: 2.740294
>>> iter 14800: 2.684119
>>> iter 14900: 2.649346
>>> iter 15000: 2.648224
>>> iter 15100: 2.654741
>>> iter 15200: 2.644998
>>> iter 15300: 2.633809
>>> iter 15400: 2.641030
>>> iter 15500: 2.660329
>>> iter 15600: 2.662161
>>> iter 15700: 2.633361
>>> iter 15800: 2.659201
>>> iter 15900: 2.651981
>>> iter 16000: 2.610024
>>> iter 16100: 2.602103
>>> iter 16200: 2.581829
>>> iter 16300: 2.550988
>>> iter 16400: 2.517804
>>> iter 16500: 2.512838
>>> iter 16600: 2.508392
>>> iter 16700: 2.520389
>>> iter 16800: 2.507302
>>> iter 16900: 2.523870
>>> iter 17000: 2.499940
>>> iter 17100: 2.513977
>>> iter 17200: 2.504930
>>> iter 17300: 2.518843
>>> iter 17400: 2.514356
>>> iter 17500: 2.516082
>>> iter 17600: 2.522284
>>> iter 17700: 2.556433
>>> iter 17800: 2.544539
>>> iter 17900: 2.575273
>>> iter 18000: 2.555194
>>> iter 18100: 2.609805
>>> iter 18200: 2.623953
>>> iter 18300: 2.615340
>>> iter 18400: 2.591577
>>> iter 18500: 2.610378
>>> iter 18600: 2.595383
>>> iter 18700: 2.581249
>>> iter 18800: 2.559557
>>> iter 18900: 2.552271
>>> iter 19000: 2.544696
>>> iter 19100: 2.586441
>>> iter 19200: 2.599598
>>> iter 19300: 2.603714
>>> iter 19400: 2.603543
>>> iter 19500: 2.578862
>>> iter 19600: 2.589547
>>> iter 19700: 2.556354
>>> iter 19800: 2.555944
>>> iter 19900: 2.527344
>>> iter 20000: 2.515879
>>> iter 20100: 2.503299
>>> iter 20200: 2.490162
>>> iter 20300: 2.501330
>>> iter 20400: 2.488866
>>> iter 20500: 2.463605
>>> iter 20600: 2.458159
>>> iter 20700: 2.433686
>>> iter 20800: 2.423189
>>> iter 20900: 2.432274
>>> iter 21000: 2.434593
>>> iter 21100: 2.425869
>>> iter 21200: 2.427357
>>> iter 21300: 2.441660
>>> iter 21400: 2.409635
>>> iter 21500: 2.403614
>>> iter 21600: 2.433992
>>> iter 21700: 2.444090
>>> iter 21800: 2.462798
>>> iter 21900: 2.462520
>>> iter 22000: 2.462924
>>> iter 22100: 2.454772
>>> iter 22200: 2.487916
>>> iter 22300: 2.488028
>>> iter 22400: 2.495287
>>> iter 22500: 2.468934
>>> iter 22600: 2.457199
>>> iter 22700: 2.455810
>>> iter 22800: 2.437125
>>> iter 22900: 2.458581
>>> iter 23000: 2.493252
>>> iter 23100: 2.471698
>>> iter 23200: 2.490359
>>> iter 23300: 2.496645
>>> iter 23400: 2.507487
>>> iter 23500: 2.534732
>>> iter 23600: 2.490386
>>> iter 23700: 2.501691
>>> iter 23800: 2.471074
>>> iter 23900: 2.465607
>>> iter 24000: 2.460707
>>> iter 24100: 2.481947
>>> iter 24200: 2.488133
>>> iter 24300: 2.451984
>>> iter 24400: 2.475954
>>> iter 24500: 2.493711
>>> iter 24600: 2.462351
>>> iter 24700: 2.482828
>>> iter 24800: 2.492132
>>> iter 24900: 2.497655
>>> iter 25000: 2.476303
>>> iter 25100: 2.492079
>>> iter 25200: 2.475966
>>> iter 25300: 2.467218
>>> iter 25400: 2.450641
>>> iter 25500: 2.433377
>>> iter 25600: 2.422653
>>> iter 25700: 2.403179
>>> iter 25800: 2.392250
>>> iter 25900: 2.405882
>>> iter 26000: 2.377624
>>> iter 26100: 2.364691
>>> iter 26200: 2.355837
>>> iter 26300: 2.345257
>>> iter 26400: 2.345679
>>> iter 26500: 2.342889
>>> iter 26600: 2.367379
>>> iter 26700: 2.393426
>>> iter 26800: 2.444213
>>> iter 26900: 2.417209
>>> iter 27000: 2.438486
>>> iter 27100: 2.438410
>>> iter 27200: 2.462644
>>> iter 27300: 2.432779
>>> iter 27400: 2.425466
>>> iter 27500: 2.421917
>>> iter 27600: 2.432899
>>> iter 27700: 2.424900
>>> iter 27800: 2.421964
>>> iter 27900: 2.441862
>>> iter 28000: 2.484662
>>> iter 28100: 2.496116
>>> iter 28200: 2.499512
>>> iter 28300: 2.496692
>>> iter 28400: 2.514777
>>> iter 28500: 2.488605
>>> iter 28600: 2.468500
>>> iter 28700: 2.435790
>>> iter 28800: 2.419856
>>> iter 28900: 2.380359
>>> iter 29000: 2.353039
>>> iter 29100: 2.344961
>>> iter 29200: 2.328637
>>> iter 29300: 2.313071
>>> iter 29400: 2.315946
>>> iter 29500: 2.296986
>>> iter 29600: 2.291862
>>> iter 29700: 2.323659
>>> iter 29800: 2.309146
>>> iter 29900: 2.293766
>>> iter 30000: 2.292499


In [10]:
# Plot training cost curve
plt.plot(cost_history, 'o')
plt.xlabel('Iteration')
plt.ylabel('Cost')
Out[10]:
<matplotlib.text.Text at 0x7f0764bfd4d0>
In [11]:
# Plot the results
final_embeddings = trained_word_vector[:vocabulary_size, :] + trained_word_vector[vocabulary_size:, :]

def plot_with_labels(low_dim_embs, labels, filename='tsne.png'):
    assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
    plt.figure(figsize=(18, 18))
    for i, label in enumerate(labels):
        x, y = low_dim_embs[i, :]
        plt.scatter(x, y)
        plt.annotate(label,
                    xy=(x,y),
                    xytext=(5, 2),
                    textcoords='offset points',
                    ha='right',
                    va='bottom')
    plt.savefig(filename)
        

tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
plot_only = 500
low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only, :])
labels = [r_dictionary[i] for i in xrange(plot_only)]
plot_with_labels(low_dim_embs, labels)
        
In [12]:
# Hand pick several words to show the distances
visualizeWords = ["a", "an", "the", "one", "first", "two", "three", "four",
                  "good", "great", "well", "poor", "bad", "evil", "god"]

visualizeIdx = [dictionary[word] for word in visualizeWords]
visualizeVecs = final_embeddings[visualizeIdx, :]
temp = (visualizeVecs - np.mean(visualizeVecs, axis=0))
covariance = 1.0 / len(visualizeIdx) * temp.T.dot(temp)
U,S,V = np.linalg.svd(covariance)
coord = temp.dot(U[:,0:2]) 

for i in xrange(len(visualizeWords)):
    plt.text(coord[i,0], coord[i,1], visualizeWords[i], 
    	bbox=dict(facecolor='green', alpha=0.1))
    
plt.xlim((np.min(coord[:,0]), np.max(coord[:,0])))
plt.ylim((np.min(coord[:,1]), np.max(coord[:,1])))

plt.savefig('q3_word_vectors.png')
plt.show()
In [13]:
# Check similarity of embedding words for top words
nom = np.sqrt(np.sum(np.square(final_embeddings), 1, keepdims=True))
normalized_embeddings = final_embeddings / nom

#randomly pick some words from top distribution
valid_size = 16
valid_example = np.random.choice(50, valid_size, replace=False)
valid_matrix = normalized_embeddings[valid_example,:]
similarity = valid_matrix.dot(normalized_embeddings.T)

for i in xrange(valid_size):
    valid_word = r_dictionary[valid_example[i]]
    top_k = 8 # number of nearest neighbors
    nearest = (similarity[i,:]).argsort()[-top_k-1:-1]
    close_words = [r_dictionary[k] for k in nearest]
    print "'{}' are closest words to '{}'".format(', '.join(close_words), valid_word)

    
'this, from, state, international, at, some, and, an' are closest words to 'in'
'zero, one, five, seven, four, eight, six, three' are closest words to 'two'
'eight, two, four, five, nine, zero, three, seven' are closest words to 'six'
'only, some, would, they, known, an, it, he' are closest words to 'be'
'sometimes, being, published, but, united, became, while, called' are closest words to 'also'
'who, such, which, from, their, his, or, this' are closest words to 's'
'both, its, in, some, would, an, international, this' are closest words to 'and'
'more, their, where, if, not, at, are, was' are closest words to 'were'
'act, however, place, there, english, they, any, what' are closest words to 'UNK'
'in, or, such, and, this, into, from, was' are closest words to 'by'
'from, its, will, states, all, this, but, his' are closest words to 'their'
'their, an, its, not, some, when, they, be' are closest words to 'he'
'both, united, some, there, often, he, not, its' are closest words to 'they'
'if, new, during, an, when, other, from, this' are closest words to 'or'
'many, after, not, this, can, some, were, but' are closest words to 'at'
'which, its, not, some, in, it, and, this' are closest words to 'an'
In [ ]:
 
blog comments powered by Disqus