Skip to Main Content?

Session 4: Visualizing And Hallucinating Representations

Creative Applications of Deep Learning with TensorFlow
Learn More About This Course
Session Media
Media Queue
Session Content
Visualizing And Hallucinating Representations
Visualizing And Hallucinating Representations

This sessions works with state of the art networks and sees how to understand what "representations" they learn. We'll see how this process actuall...

Session 4: Visualizing Representations

Learning Goals

  • Learn how to inspect deep networks by visualizing their gradients
  • Learn how to "deep dream" with different objective functions and regularization techniques
  • Learn how to "stylize" an image using content and style losses from different images

Table of Contents

  • Introduction
  • Deep Convolutional Networks
    • Loading a Pretrained Network
    • Predicting with the Inception Network
    • Visualizing Filters
    • Visualizing the Gradient
  • Deep Dreaming
    • Simplest Approach
    • Specifying the Objective
    • Decaying the Gradien
    • Blurring the Gradien
    • Clipping the Gradient
    • Infinite Zoom / Fractal
    • Laplacian Pyramid
  • Style Net
    • VGG Network
    • Dropout
    • Defining the Content Features
    • Defining the Style Features
    • Remapping the Input
    • Defining the Content Loss
    • Defining the Style Loss
    • Defining the Total Variation Loss
    • Training
  • Homework
  • Reading

Introduction

So far, we've seen that a deep convolutional network can get very high accuracy in classifying the MNIST dataset, a dataset of handwritten digits numbered 0 - 9. What happens when the number of classes grows higher than 10 possibilities? Or the images get much larger? We're going to explore a few new datasets and bigger and better models to try and find out. We'll then explore a few interesting visualization tehcniques to help us understand what the networks are representing in its deeper layers and how these techniques can be used for some very interesting creative applications.

Deep Convolutional Networks

Almost 30 years of computer vision and machine learning research based on images takes an approach to processing images like what we saw at the end of Session 1: you take an image, convolve it with a set of edge detectors like the gabor filter we created, and then find some thresholding of this image to find more interesting features, such as corners, or look at histograms of the number of some orientation of edges in a particular window. In the previous session, we started to see how Deep Learning has allowed us to move away from hand crafted features such as Gabor-like filters to letting data discover representations. Though, how well does it scale?

A seminal shift in the perceived capabilities of deep neural networks occurred in 2012. A network dubbed AlexNet, after its primary author, Alex Krizevsky, achieved remarkable performance on one of the most difficult computer vision datasets at the time, ImageNet. <TODO: Insert montage of ImageNet>. ImageNet is a dataset used in a yearly challenge called the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), started in 2010. The dataset contains nearly 1.2 million images composed of 1000 different types of objects. Each object has anywhere between 600 - 1200 different images. <TODO: Histogram of object labels>

Up until now, the most number of labels we've considered is 10! The image sizes were also very small, only 28 x 28 pixels, and it didn't even have color.

Let's look at a state-of-the-art network that has already been trained on ImageNet.

Loading a Pretrained Network

We can use an existing network that has been trained by loading the model's weights into a network definition. The network definition is basically saying what are the set of operations in the tensorflow graph. So how is the image manipulated, filtered, in order to get from an input image to a probability saying which 1 of 1000 possible objects is the image describing? It also restores the model's weights. Those are the values of every parameter in the network learned through gradient descent. Luckily, many researchers are releasing their model definitions and weights so we don't have to train them! We just have to load them up and then we can use the model straight away. That's very lucky for us because these models take a lot of time, cpu, memory, and money to train.

To get the files required for these models, you'll need to download them from the resources page.

First, let's import some necessary libraries.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import IPython.display as ipyd
from libs import gif, nb_utils
In [2]:
# Bit of formatting because I don't like the default inline code style:
from IPython.core.display import HTML
HTML("""<style> .rendered_html code { 
    padding: 2px 4px;
    color: #c7254e;
    background-color: #f9f2f4;
    border-radius: 4px;
} </style>""")
Out[2]:

Start an interactive session:

In [3]:
sess = tf.InteractiveSession()

Now we'll load Google's Inception model, which is a pretrained network for classification built using the ImageNet database. I've included some helper functions for getting this model loaded and setup w/ Tensorflow.

In [4]:
from libs import inception
net = inception.get_inception_model()

Here's a little extra that wasn't in the lecture. We can visualize the graph definition using the nb_utils module's show_graphfunction. This function is taken from an example in the Tensorflow repo so I can't take credit for it! It uses Tensorboard, which we didn't get a chance to discuss, Tensorflow's web interface for visualizing graphs and training performance. It is very useful but we sadly did not have enough time to discuss this!

In [5]:
nb_utils.show_graph(net['graph_def'])

We'll now get the graph from the storage container, and tell tensorflow to use this as its own graph. This will add all the computations we need to compute the entire deep net, as well as all of the pre-trained parameters.

In [6]:
tf.import_graph_def(net['graph_def'], name='inception')
In [7]:
net['labels']
Out[7]:
[(0, 'dummy'),
 (1, 'kit fox'),
 (2, 'English setter'),
 (3, 'Siberian husky'),
 ...]

<TODO: visual of graph>

Let's have a look at the graph:

In [8]:
g = tf.get_default_graph()
names = [op.name for op in g.get_operations()]
print(names)
['inception/input', 'inception/conv2d0_w', 'inception/conv2d0_b', 'inception/conv2d1_w', 'inception/conv2d1_b', 'inception/conv2d2_w', 'inception/conv2d2_b', 'inception/mixed3a_1x1_w', 'inception/mixed3a_1x1_b', 'inception/mixed3a_3x3_bottleneck_w', 'inception/mixed3a_3x3_bottleneck_b', 'inception/mixed3a_3x3_w', 'inception/mixed3a_3x3_b', 'inception/mixed3a_5x5_bottleneck_w', 'inception/mixed3a_5x5_bottleneck_b', 'inception/mixed3a_5x5_w', 'inception/mixed3a_5x5_b', 'inception/mixed3a_pool_reduce_w', 'inception/mixed3a_pool_reduce_b', 'inception/mixed3b_1x1_w', 'inception/mixed3b_1x1_b', 'inception/mixed3b_3x3_bottleneck_w', 'inception/mixed3b_3x3_bottleneck_b', 'inception/mixed3b_3x3_w', 'inception/mixed3b_3x3_b', 'inception/mixed3b_5x5_bottleneck_w', 'inception/mixed3b_5x5_bottleneck_b', 'inception/mixed3b_5x5_w', 'inception/mixed3b_5x5_b', 'inception/mixed3b_pool_reduce_w', 'inception/mixed3b_pool_reduce_b', 'inception/mixed4a_1x1_w', 'inception/mixed4a_1x1_b', 'inception/mixed4a_3x3_bottleneck_w', 'inception/mixed4a_3x3_bottleneck_b', 'inception/mixed4a_3x3_w', 'inception/mixed4a_3x3_b', 'inception/mixed4a_5x5_bottleneck_w', 'inception/mixed4a_5x5_bottleneck_b', 'inception/mixed4a_5x5_w', 'inception/mixed4a_5x5_b', 'inception/mixed4a_pool_reduce_w', 'inception/mixed4a_pool_reduce_b', 'inception/mixed4b_1x1_w', 'inception/mixed4b_1x1_b', 'inception/mixed4b_3x3_bottleneck_w', 'inception/mixed4b_3x3_bottleneck_b', 'inception/mixed4b_3x3_w', 'inception/mixed4b_3x3_b', 'inception/mixed4b_5x5_bottleneck_w', 'inception/mixed4b_5x5_bottleneck_b', 'inception/mixed4b_5x5_w', 'inception/mixed4b_5x5_b', 'inception/mixed4b_pool_reduce_w', 'inception/mixed4b_pool_reduce_b', 'inception/mixed4c_1x1_w', 'inception/mixed4c_1x1_b', 'inception/mixed4c_3x3_bottleneck_w', 'inception/mixed4c_3x3_bottleneck_b', 'inception/mixed4c_3x3_w', 'inception/mixed4c_3x3_b', 'inception/mixed4c_5x5_bottleneck_w', 'inception/mixed4c_5x5_bottleneck_b', 'inception/mixed4c_5x5_w', 'inception/mixed4c_5x5_b', 'inception/mixed4c_pool_reduce_w', 'inception/mixed4c_pool_reduce_b', 'inception/mixed4d_1x1_w', 'inception/mixed4d_1x1_b', 'inception/mixed4d_3x3_bottleneck_w', 'inception/mixed4d_3x3_bottleneck_b', 'inception/mixed4d_3x3_w', 'inception/mixed4d_3x3_b', 'inception/mixed4d_5x5_bottleneck_w', 'inception/mixed4d_5x5_bottleneck_b', 'inception/mixed4d_5x5_w', 'inception/mixed4d_5x5_b', 'inception/mixed4d_pool_reduce_w', 'inception/mixed4d_pool_reduce_b', 'inception/mixed4e_1x1_w', 'inception/mixed4e_1x1_b', 'inception/mixed4e_3x3_bottleneck_w', 'inception/mixed4e_3x3_bottleneck_b', 'inception/mixed4e_3x3_w', 'inception/mixed4e_3x3_b', 'inception/mixed4e_5x5_bottleneck_w', 'inception/mixed4e_5x5_bottleneck_b', 'inception/mixed4e_5x5_w', 'inception/mixed4e_5x5_b', 'inception/mixed4e_pool_reduce_w', 'inception/mixed4e_pool_reduce_b', 'inception/mixed5a_1x1_w', 'inception/mixed5a_1x1_b', 'inception/mixed5a_3x3_bottleneck_w', 'inception/mixed5a_3x3_bottleneck_b', 'inception/mixed5a_3x3_w', 'inception/mixed5a_3x3_b', 'inception/mixed5a_5x5_bottleneck_w', 'inception/mixed5a_5x5_bottleneck_b', 'inception/mixed5a_5x5_w', 'inception/mixed5a_5x5_b', 'inception/mixed5a_pool_reduce_w', 'inception/mixed5a_pool_reduce_b', 'inception/mixed5b_1x1_w', 'inception/mixed5b_1x1_b', 'inception/mixed5b_3x3_bottleneck_w', 'inception/mixed5b_3x3_bottleneck_b', 'inception/mixed5b_3x3_w', 'inception/mixed5b_3x3_b', 'inception/mixed5b_5x5_bottleneck_w', 'inception/mixed5b_5x5_bottleneck_b', 'inception/mixed5b_5x5_w', 'inception/mixed5b_5x5_b', 'inception/mixed5b_pool_reduce_w', 'inception/mixed5b_pool_reduce_b', 'inception/head0_bottleneck_w', 'inception/head0_bottleneck_b', 'inception/nn0_w', 'inception/nn0_b', 'inception/softmax0_w', 'inception/softmax0_b', 'inception/head1_bottleneck_w', 'inception/head1_bottleneck_b', 'inception/nn1_w', 'inception/nn1_b', 'inception/softmax1_w', 'inception/softmax1_b', 'inception/softmax2_w', 'inception/softmax2_b', 'inception/conv2d0_pre_relu/conv', 'inception/conv2d0_pre_relu', 'inception/conv2d0', 'inception/maxpool0', 'inception/localresponsenorm0', 'inception/conv2d1_pre_relu/conv', 'inception/conv2d1_pre_relu', 'inception/conv2d1', 'inception/conv2d2_pre_relu/conv', 'inception/conv2d2_pre_relu', 'inception/conv2d2', 'inception/localresponsenorm1', 'inception/maxpool1', 'inception/mixed3a_1x1_pre_relu/conv', 'inception/mixed3a_1x1_pre_relu', 'inception/mixed3a_1x1', 'inception/mixed3a_3x3_bottleneck_pre_relu/conv', 'inception/mixed3a_3x3_bottleneck_pre_relu', 'inception/mixed3a_3x3_bottleneck', 'inception/mixed3a_3x3_pre_relu/conv', 'inception/mixed3a_3x3_pre_relu', 'inception/mixed3a_3x3', 'inception/mixed3a_5x5_bottleneck_pre_relu/conv', 'inception/mixed3a_5x5_bottleneck_pre_relu', 'inception/mixed3a_5x5_bottleneck', 'inception/mixed3a_5x5_pre_relu/conv', 'inception/mixed3a_5x5_pre_relu', 'inception/mixed3a_5x5', 'inception/mixed3a_pool', 'inception/mixed3a_pool_reduce_pre_relu/conv', 'inception/mixed3a_pool_reduce_pre_relu', 'inception/mixed3a_pool_reduce', 'inception/mixed3a/concat_dim', 'inception/mixed3a', 'inception/mixed3b_1x1_pre_relu/conv', 'inception/mixed3b_1x1_pre_relu', 'inception/mixed3b_1x1', 'inception/mixed3b_3x3_bottleneck_pre_relu/conv', 'inception/mixed3b_3x3_bottleneck_pre_relu', 'inception/mixed3b_3x3_bottleneck', 'inception/mixed3b_3x3_pre_relu/conv', 'inception/mixed3b_3x3_pre_relu', 'inception/mixed3b_3x3', 'inception/mixed3b_5x5_bottleneck_pre_relu/conv', 'inception/mixed3b_5x5_bottleneck_pre_relu', 'inception/mixed3b_5x5_bottleneck', 'inception/mixed3b_5x5_pre_relu/conv', 'inception/mixed3b_5x5_pre_relu', 'inception/mixed3b_5x5', 'inception/mixed3b_pool', 'inception/mixed3b_pool_reduce_pre_relu/conv', 'inception/mixed3b_pool_reduce_pre_relu', 'inception/mixed3b_pool_reduce', 'inception/mixed3b/concat_dim', 'inception/mixed3b', 'inception/maxpool4', 'inception/mixed4a_1x1_pre_relu/conv', 'inception/mixed4a_1x1_pre_relu', 'inception/mixed4a_1x1', 'inception/mixed4a_3x3_bottleneck_pre_relu/conv', 'inception/mixed4a_3x3_bottleneck_pre_relu', 'inception/mixed4a_3x3_bottleneck', 'inception/mixed4a_3x3_pre_relu/conv', 'inception/mixed4a_3x3_pre_relu', 'inception/mixed4a_3x3', 'inception/mixed4a_5x5_bottleneck_pre_relu/conv', 'inception/mixed4a_5x5_bottleneck_pre_relu', 'inception/mixed4a_5x5_bottleneck', 'inception/mixed4a_5x5_pre_relu/conv', 'inception/mixed4a_5x5_pre_relu', 'inception/mixed4a_5x5', 'inception/mixed4a_pool', 'inception/mixed4a_pool_reduce_pre_relu/conv', 'inception/mixed4a_pool_reduce_pre_relu', 'inception/mixed4a_pool_reduce', 'inception/mixed4a/concat_dim', 'inception/mixed4a', 'inception/mixed4b_1x1_pre_relu/conv', 'inception/mixed4b_1x1_pre_relu', 'inception/mixed4b_1x1', 'inception/mixed4b_3x3_bottleneck_pre_relu/conv', 'inception/mixed4b_3x3_bottleneck_pre_relu', 'inception/mixed4b_3x3_bottleneck', 'inception/mixed4b_3x3_pre_relu/conv', 'inception/mixed4b_3x3_pre_relu', 'inception/mixed4b_3x3', 'inception/mixed4b_5x5_bottleneck_pre_relu/conv', 'inception/mixed4b_5x5_bottleneck_pre_relu', 'inception/mixed4b_5x5_bottleneck', 'inception/mixed4b_5x5_pre_relu/conv', 'inception/mixed4b_5x5_pre_relu', 'inception/mixed4b_5x5', 'inception/mixed4b_pool', 'inception/mixed4b_pool_reduce_pre_relu/conv', 'inception/mixed4b_pool_reduce_pre_relu', 'inception/mixed4b_pool_reduce', 'inception/mixed4b/concat_dim', 'inception/mixed4b', 'inception/mixed4c_1x1_pre_relu/conv', 'inception/mixed4c_1x1_pre_relu', 'inception/mixed4c_1x1', 'inception/mixed4c_3x3_bottleneck_pre_relu/conv', 'inception/mixed4c_3x3_bottleneck_pre_relu', 'inception/mixed4c_3x3_bottleneck', 'inception/mixed4c_3x3_pre_relu/conv', 'inception/mixed4c_3x3_pre_relu', 'inception/mixed4c_3x3', 'inception/mixed4c_5x5_bottleneck_pre_relu/conv', 'inception/mixed4c_5x5_bottleneck_pre_relu', 'inception/mixed4c_5x5_bottleneck', 'inception/mixed4c_5x5_pre_relu/conv', 'inception/mixed4c_5x5_pre_relu', 'inception/mixed4c_5x5', 'inception/mixed4c_pool', 'inception/mixed4c_pool_reduce_pre_relu/conv', 'inception/mixed4c_pool_reduce_pre_relu', 'inception/mixed4c_pool_reduce', 'inception/mixed4c/concat_dim', 'inception/mixed4c', 'inception/mixed4d_1x1_pre_relu/conv', 'inception/mixed4d_1x1_pre_relu', 'inception/mixed4d_1x1', 'inception/mixed4d_3x3_bottleneck_pre_relu/conv', 'inception/mixed4d_3x3_bottleneck_pre_relu', 'inception/mixed4d_3x3_bottleneck', 'inception/mixed4d_3x3_pre_relu/conv', 'inception/mixed4d_3x3_pre_relu', 'inception/mixed4d_3x3', 'inception/mixed4d_5x5_bottleneck_pre_relu/conv', 'inception/mixed4d_5x5_bottleneck_pre_relu', 'inception/mixed4d_5x5_bottleneck', 'inception/mixed4d_5x5_pre_relu/conv', 'inception/mixed4d_5x5_pre_relu', 'inception/mixed4d_5x5', 'inception/mixed4d_pool', 'inception/mixed4d_pool_reduce_pre_relu/conv', 'inception/mixed4d_pool_reduce_pre_relu', 'inception/mixed4d_pool_reduce', 'inception/mixed4d/concat_dim', 'inception/mixed4d', 'inception/mixed4e_1x1_pre_relu/conv', 'inception/mixed4e_1x1_pre_relu', 'inception/mixed4e_1x1', 'inception/mixed4e_3x3_bottleneck_pre_relu/conv', 'inception/mixed4e_3x3_bottleneck_pre_relu', 'inception/mixed4e_3x3_bottleneck', 'inception/mixed4e_3x3_pre_relu/conv', 'inception/mixed4e_3x3_pre_relu', 'inception/mixed4e_3x3', 'inception/mixed4e_5x5_bottleneck_pre_relu/conv', 'inception/mixed4e_5x5_bottleneck_pre_relu', 'inception/mixed4e_5x5_bottleneck', 'inception/mixed4e_5x5_pre_relu/conv', 'inception/mixed4e_5x5_pre_relu', 'inception/mixed4e_5x5', 'inception/mixed4e_pool', 'inception/mixed4e_pool_reduce_pre_relu/conv', 'inception/mixed4e_pool_reduce_pre_relu', 'inception/mixed4e_pool_reduce', 'inception/mixed4e/concat_dim', 'inception/mixed4e', 'inception/maxpool10', 'inception/mixed5a_1x1_pre_relu/conv', 'inception/mixed5a_1x1_pre_relu', 'inception/mixed5a_1x1', 'inception/mixed5a_3x3_bottleneck_pre_relu/conv', 'inception/mixed5a_3x3_bottleneck_pre_relu', 'inception/mixed5a_3x3_bottleneck', 'inception/mixed5a_3x3_pre_relu/conv', 'inception/mixed5a_3x3_pre_relu', 'inception/mixed5a_3x3', 'inception/mixed5a_5x5_bottleneck_pre_relu/conv', 'inception/mixed5a_5x5_bottleneck_pre_relu', 'inception/mixed5a_5x5_bottleneck', 'inception/mixed5a_5x5_pre_relu/conv', 'inception/mixed5a_5x5_pre_relu', 'inception/mixed5a_5x5', 'inception/mixed5a_pool', 'inception/mixed5a_pool_reduce_pre_relu/conv', 'inception/mixed5a_pool_reduce_pre_relu', 'inception/mixed5a_pool_reduce', 'inception/mixed5a/concat_dim', 'inception/mixed5a', 'inception/mixed5b_1x1_pre_relu/conv', 'inception/mixed5b_1x1_pre_relu', 'inception/mixed5b_1x1', 'inception/mixed5b_3x3_bottleneck_pre_relu/conv', 'inception/mixed5b_3x3_bottleneck_pre_relu', 'inception/mixed5b_3x3_bottleneck', 'inception/mixed5b_3x3_pre_relu/conv', 'inception/mixed5b_3x3_pre_relu', 'inception/mixed5b_3x3', 'inception/mixed5b_5x5_bottleneck_pre_relu/conv', 'inception/mixed5b_5x5_bottleneck_pre_relu', 'inception/mixed5b_5x5_bottleneck', 'inception/mixed5b_5x5_pre_relu/conv', 'inception/mixed5b_5x5_pre_relu', 'inception/mixed5b_5x5', 'inception/mixed5b_pool', 'inception/mixed5b_pool_reduce_pre_relu/conv', 'inception/mixed5b_pool_reduce_pre_relu', 'inception/mixed5b_pool_reduce', 'inception/mixed5b/concat_dim', 'inception/mixed5b', 'inception/avgpool0', 'inception/head0_pool', 'inception/head0_bottleneck_pre_relu/conv', 'inception/head0_bottleneck_pre_relu', 'inception/head0_bottleneck', 'inception/head0_bottleneck/reshape/shape', 'inception/head0_bottleneck/reshape', 'inception/nn0_pre_relu/matmul', 'inception/nn0_pre_relu', 'inception/nn0', 'inception/nn0/reshape/shape', 'inception/nn0/reshape', 'inception/softmax0_pre_activation/matmul', 'inception/softmax0_pre_activation', 'inception/softmax0', 'inception/head1_pool', 'inception/head1_bottleneck_pre_relu/conv', 'inception/head1_bottleneck_pre_relu', 'inception/head1_bottleneck', 'inception/head1_bottleneck/reshape/shape', 'inception/head1_bottleneck/reshape', 'inception/nn1_pre_relu/matmul', 'inception/nn1_pre_relu', 'inception/nn1', 'inception/nn1/reshape/shape', 'inception/nn1/reshape', 'inception/softmax1_pre_activation/matmul', 'inception/softmax1_pre_activation', 'inception/softmax1', 'inception/avgpool0/reshape/shape', 'inception/avgpool0/reshape', 'inception/softmax2_pre_activation/matmul', 'inception/softmax2_pre_activation', 'inception/softmax2', 'inception/output', 'inception/output1', 'inception/output2']

The input to the graph is stored in the first tensor output, and the probability of the 1000 possible objects is in the last layer:

In [9]:
input_name = names[0] + ':0'
x = g.get_tensor_by_name(input_name)
In [10]:
softmax = g.get_tensor_by_name(names[-1] + ':0')

Predicting with the Inception Network

Let's try to use the network to predict now:

In [11]:
from skimage.data import coffee
og = coffee()
plt.imshow(og)
print(og.min(), og.max())
0 255

We'll crop and resize the image to 299 x 299 pixels. I've provided a simple helper function which will do this for us:

In [12]:
# Note that in the lecture, I used a slightly different inception
# model, and this one requires us to subtract the mean from the input image.
# The preprocess function will also crop/resize the image to 299x299
img = inception.preprocess(og)
print(og.shape), print(img.shape)
(400, 600, 3)
(299, 299, 3)
Out[12]:
(None, None)
In [13]:
# So this will now be a different range than what we had in the lecture:
print(img.min(), img.max())
-117.0 138.0

As we've seen from the last session, our images must be shaped as a 4-dimensional shape describing the number of images, height, width, and number of channels. So our original 3-dimensional image of height, width, channels needs an additional dimension on the 0th axis.

In [14]:
img_4d = img[np.newaxis]
print(img_4d.shape)
(1, 299, 299, 3)
In [15]:
fig, axs = plt.subplots(1, 2)
axs[0].imshow(og)

# Note that unlike the lecture, we have to call the `inception.deprocess` function
# so that it adds back the mean!
axs[1].imshow(inception.deprocess(img))
Out[15]:
<matplotlib.image.AxesImage at 0x13ef9fcc0>
In [16]:
res = np.squeeze(softmax.eval(feed_dict={x: img_4d}))
In [17]:
# Note that this network is slightly different than the one used in the lecture.
# Instead of just 1 output, there will be 16 outputs of 1008 probabilities.
# We only use the first 1000 probabilities (the extra ones are for negative/unseen labels)
res.shape
Out[17]:
(16, 1008)

The result of the network is a 1000 element vector, with probabilities of each class. Inside our net dictionary are the labels for every element. We can sort these and use the labels of the 1000 classes to see what the top 5 predicted probabilities and labels are:

In [18]:
# Note that this is one way to aggregate the different probabilities.  We could also
# take the argmax.
res = np.mean(res, 0)
res = res / np.sum(res)
In [19]:
print([(res[idx], net['labels'][idx])
       for idx in res.argsort()[-5:][::-1]])
[(0.99849206, (947, 'espresso')), (0.000631253, (859, 'cup')), (0.00050241494, (953, 'chocolate sauce')), (0.00019483207, (844, 'consomme')), (0.00013370356, (822, 'soup bowl'))]

Visualizing Filters

Wow so it works! But how!? Well that's an ongoing research question. There has been a lot of great developments in the last few years to help us understand what might be happening. Let's try to first visualize the weights of the convolution filters, like we've done with our MNIST network before.

In [20]:
W = g.get_tensor_by_name('inception/conv2d0_w:0')
W_eval = W.eval()
print(W_eval.shape)
(7, 7, 3, 64)

With MNIST, our input number of filters was 1, since our input number of channels was also 1, as all of MNIST is grayscale. But in this case, our input number of channels is 3, and so the input number of convolution filters is also 3. We can try to see every single individual filter using the library tool I've provided:

In [21]:
from libs import utils
W_montage = utils.montage_filters(W_eval)
plt.figure(figsize=(10,10))
plt.imshow(W_montage, interpolation='nearest')
Out[21]:
<matplotlib.image.AxesImage at 0x1379950b8>

Or, we can also try to look at them as RGB filters, showing the influence of each color channel, for each neuron or output filter.

In [22]:
Ws = [utils.montage_filters(W_eval[:, :, [i], :]) for i in range(3)]
Ws = np.rollaxis(np.array(Ws), 0, 3)
plt.figure(figsize=(10,10))
plt.imshow(Ws, interpolation='nearest')
Out[22]:
<matplotlib.image.AxesImage at 0x143c37550>

In order to better see what these are doing, let's normalize the filters range:

In [23]:
np.min(Ws), np.max(Ws)
Ws = (Ws / np.max(np.abs(Ws)) * 128 + 128).astype(np.uint8)
plt.figure(figsize=(10,10))
plt.imshow(Ws, interpolation='nearest')
Out[23]:
<matplotlib.image.AxesImage at 0x14408d710>

Like with our MNIST example, we can probably guess what some of these are doing. They are responding to edges, corners, and center-surround or some kind of contrast of two things, like red, green, blue yellow, which interestingly is also what neuroscience of vision tells us about how the human vision identifies color, which is through opponency of red/green and blue/yellow. To get a better sense, we can try to look at the output of the convolution:

In [24]:
feature = g.get_tensor_by_name('inception/conv2d0_pre_relu:0')

Let's look at the shape:

In [25]:
layer_shape = tf.shape(feature).eval(feed_dict={x:img_4d})
print(layer_shape)
[  1 150 150  64]

So our original image which was 1 x 299 x 299 x 3 color channels, now has 64 new channels of information. The image's height and width are also halved, because of the stride of 2 in the convolution. We've just seen what each of the convolution filters look like. Let's try to see how they filter the image now by looking at the resulting convolution.

In [26]:
f = feature.eval(feed_dict={x: img_4d})
montage = utils.montage_filters(np.rollaxis(np.expand_dims(f[0], 3), 3, 2))
fig, axs = plt.subplots(1, 3, figsize=(20, 10))
axs[0].imshow(inception.deprocess(img))
axs[0].set_title('Original Image')
axs[1].imshow(Ws, interpolation='nearest')
axs[1].set_title('Convolution Filters')
axs[2].imshow(montage, cmap='gray')
axs[2].set_title('Convolution Outputs')
Out[26]:
<matplotlib.text.Text at 0x1406f3cc0>

It's a little hard to see what's happening here but let's try. The third filter for instance seems to be a lot like the gabor filter we created in the first session. It respond to horizontal edges, since it has a bright component at the top, and a dark component on the bottom. Looking at the output of the convolution, we can see that the horizontal edges really pop out.

Visualizing the Gradient

So this is a pretty useful technique for the first convolution layer. But when we get to the next layer, all of sudden we have 64 different channels of information being fed to more convolution filters of some very high dimensions. It's very hard to conceptualize that many dimensions, let alone also try and figure out what it could be doing with all the possible combinations it has with other neurons in other layers.

If we want to understand what the deeper layers are really doing, we're going to have to start to use backprop to show us the gradients of a particular neuron with respect to our input image. Let's visualize the network's gradient activation when backpropagated to the original input image. This is effectively telling us which pixels are responding to the predicted class or given neuron.

We use a forward pass up to the layer that we are interested in, and then a backprop to help us understand what pixels in particular contributed to the final activation of that layer. We will need to create an operation which will find the max neuron of all activations in a layer, and then calculate the gradient of that objective with respect to the input image.

In [27]:
feature = g.get_tensor_by_name('inception/conv2d0_pre_relu:0')
gradient = tf.gradients(tf.reduce_max(feature, 3), x)

When we run this network now, we will specify the gradient operation we've created, instead of the softmax layer of the network. This will run a forward prop up to the layer we asked to find the gradient with, and then run a back prop all the way to the input image.

In [28]:
res = sess.run(gradient, feed_dict={x: img_4d})[0]

Let's visualize the original image and the output of the backpropagated gradient:

In [29]:
fig, axs = plt.subplots(1, 2)
axs[0].imshow(inception.deprocess(img))
axs[1].imshow(res[0])
Out[29]:
<matplotlib.image.AxesImage at 0x146bc2940>