Skip to Main Content?

Session 2: Training A Network W/ TensorFlow

Creative Applications of Deep Learning with TensorFlow
Learn More About This Course
Session Media
Media Queue
Session Content
Training A Network W/ TensorFlow
Training A Network W/ TensorFlow

We'll see how neural networks work, how they are "trained", and see the basic components of training a neural network. We'll then build our first n...

Session 2: Training a network w/ Tensorflow

Creative Applications of Deep Learning with Google's Tensorflow
Parag K. Mital
Kadenze, Inc.

Learning Goals

  • The basic components of a neural network
  • How to use gradient descent to optimize parameters of a neural network
  • How to create a neural network for performing regression

Table of Contents

  • Introduction
  • Gradient Descent
    • Defining Cost
    • Minimizing Error
    • Backpropagation
    • Extra details for notebook only
    • Local Minima/Optima
    • Learning Rate
  • Creating a Neural Network
    • Defining Cost
    • Training Parameters
    • Training vs. Testing
    • Stochastic and Mini Batch Gradient Descent
    • Input's Representation
    • Over vs. Underfitting
    • Introducing Nonlinearities / Activation Function
    • Going Deeper
  • Image Inpainting
    • Description
    • Building the Network
    • Training
  • Homework:
  • Reading:

Introduction

In this session we're going to take everything we've learned about Graphs, Sessions, Operations, and Tensors and use them all to form a neural network. We're going to learn how we can use data and something called gradient descent to teach the network what the values of the parameters of this network should be.

In the last session, we saw how to normalize a dataset, using the dataset's mean and standard deviation. While this seemed to reveal some interesting representations of our dataset, it left us with a lot more to explain. In the case of faces, it really seemed to explain more about the background than the actual faces. For instance, it wasn't able to describe the differences between different races, gender, expressions, hair style, hair color, or the other many various differences that one might be interested in.

What we're really interested in is letting the computer figure out what representations it needs in order to better describe the data and some objective that we've defined. That is the fundamental idea behind machine learning: letting the machine learn from the data. In this session, we're going to start to see how to do that.

Before we get into the details, I'm going to go over some background on gradient descent and the different components of a neural network. If you're comfortable with all of this, please feel free to skip ahead.

Gradient Descent

Whenever we create a neural network, we have to define a set of operations. These operations try to take us from some input to some output. For instance, the input might be an image, or frame of a video, or text file, or sound file. The operations of the network are meant to transform this input data into something meaningful that we want the network to learn about.

Initially, all of the parameters of the network are random. So whatever is being output will also be random. But let's say we need it to output something specific about the image. To teach it to do that, we're going to use something called "Gradient Descent". Simply, Gradient descent is a way of optimizing a set of parameters.

Let's say we have a few images, and know that given a certain image, when I feed it through a network, its parameters should help the final output of the network be able to spit out the word "orange", or "apple", or some appropriate label given the image of that object. The parameters should somehow accentuate the "orangeness" of my image. It probably will be able to transform an image in away that it ends up having high intensities for images that have the color orange in them, and probably prefer images that have that color in a fairly round arrangement.

Rather than hand crafting all of the possible ways an orange might be manifested, we're going to learn the best way to optimize its objective: separating oranges and apples. How can we teach a network to learn something like this?

Defining Cost

Well we need to define what "best" means. In order to do so, we need a measure of the "error". Let's continue with the two options we've been using: orange, or apple. I can represent these as 0 and 1 instead.

I'm going to get a few images of oranges, and apples, and one by one, feed them into a network that I've randomly initialized. I'll then filter the image, by just multiplying every value by some random set of values. And then I'll just add up all the numbers, and then squash the result in a way that means I'll only ever get 0 or 1. So I put in an image, and I get out a 0 or 1. Except, the parameters of my network are totally random, and so my network will only ever spit out random 0s or 1s. How can I get this random network to know when to spit out a 0 for images of oranges, and a 1 for images of apples?

We do that by saying, if the network predicts a 0 for an orange, then the error is 0. If the network predicts a 1 for an orange, then the error is 1. And vice-versa for apples. If it spits out a 1 for an apple, then the error is 0. If it spits out a 0 for an apple, then the error is 1. What we've just done is create a function which describes error in terms of our parameters:

Let's write this another way:

$$\begin{align}
\text{error} = \text{network}(\text{image}) - \text{true_label}
\end{align}$$

where

$$\begin{align}
\text{network}(\text{image}) = \text{predicted_label}
\end{align}$$

More commonly, we'll see these components represented by the following letters:

$$\begin{align}
E = f(X) - y
\end{align}$$

Don't worry about trying to remember this equation. Just see how it is similar to what we've done with the oranges and apples. X is generally the input to the network, which is fed to some network, or a function $f$, which we know should output some label y. Whatever difference there is between what it should output, y, and what it actually outputs, $f(x)$ is what is different, or error, $E$.

Minimizing Error

Instead of feeding one image at a time, we're going to feed in many. Let's say 100. This way, we can see what our network is doing on average. If our error at the current network parameters is e.g. 50/100, we're correctly guessing about 50 of the 100 images.

Now for the crucial part. If we move our network's parameters a tiny bit and see what happens to our error, we can actually use that knowledge to find smaller errors. Let's say the error went up after we moved our network parameters. Well then we know we should go back the way we came, and try going the other direction entirely. If our error went down, then we should just keep changing our parameters in the same direction. The error provides a "training signal" or a measure of the "loss" of our network. You'll often hear anyone number of these terms to describe the same thing, "Error", "Cost", "Loss", or "Training Signal'. That's pretty much gradient descent in a nutshell. Of course we've made a lot of assumptions in assuming our function is continuous and differentiable. But we're not going to worry about that, and if you don't know what that means, don't worry about it.

Backpropagation

To summarize, Gradient descent is a simple but very powerful method for finding smaller measures of error by following the negative direction of its gradient. The gradient is just saying, how does the error change at the current set of parameters?

One thing I didn't mention was how we figure out what the gradient is. In order to do that, we use something called backpropagation. When we pass as input something to a network, it's doing what's called forward propagation. We're sending an input and multiplying it by every weight to an expected output. Whatever differences that output has with the output we wanted it to have, gets backpropagated to every single parameter in our network. Basically, backprop is a very effective way to find the gradient by simply multiplying many partial derivatives together. It uses something called the chain rule to find the gradient of the error with respect to every single parameter in a network, and follows this error from the output of the network, all the way back to the input.

While the details won't be necessary for this course, we will come back to it in later sessions as we learn more about how we can use both backprop and forward prop to help us understand the inner workings of deep neural networks.

If you are interested in knowing more details about backprop, I highly recommend both Michael Nielsen's online Deep Learning book:

http://neuralnetworksanddeeplearning.com/

and Yoshua Bengio's online book:

http://www.deeplearningbook.org/

Extra details for notebook only

To think about this another way, the definition of a linear function is written like so:

$$\begin{align}
y = mx + b
\end{align}$$

The slope, or gradient of this function is $m$ everywhere. It's describing how the function changes with different network parameters. If I follow the negative value of $m$, then I'm going down the slope, towards smaller values.

But not all functions are linear. Let's say the error was something like a parabola:

$$\begin{align}
y(x) = x^2
\end{align}$$

That just says, there is a function y, which takes one parameter, $x$, and this function just takes the value of $x$ and multiplies it by itself, or put another way, it outputs $x^2$. Let's start at the minimum. At $x = 0$, our function $y(0) = 0$. Let's try and move a random amount, and say we end up at $1$. So at $x = 1$, we know that our function went up from $y(0) = 0$ to $y(1) = 1$. The change in $y = 1$. The change in $x = 1$. So our slope is:

$$\begin{align}
\frac{\text{change in } y}{\text{change in } x} = \frac{(y(1) - y(0)}{(1 - 0)} = \frac{1}{1} = 1
\end{align}$$

If we go in the negative direction of this, $x = x - 1$, we get back to 0, our minimum value.

If you try this process for any value and you'll see that if you keep going towards the negative slope, you go towards smaller values.

You might also see this process described like so:

$$\begin{align}
\theta = \theta - \eta \cdot \nabla_\theta J( \theta)
\end{align}$$

That's just saying the same thing really. We're going to update our parameters, commonly referred to by $\theta$, by finding the gradient, $\nabla$ with respect to parameters $\theta$, $\nabla_\theta$, of our error, $J$, and moving down the negative direction of it: $- \eta \cdot \nabla_\theta J( \theta)$. The $\eta$ is just a parameter also known as the learning rate, and it describes how far along this gradient we should travel, and we'll typically set this value from anywhere between 0.01 to 0.00001.

Local Minima/Optima

Before we start, we're going to need some library imports:

In [1]:
# imports
%matplotlib inline
# %pylab osx
import os
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.cm as cmx
plt.style.use('ggplot')

One pitfall of gradient descent is that some functions contain "minima", which is another way of saying a trough, or a concave point, or put another way, a dip in a function.

Let's say, purely for illustration, that our cost function looked like:

In [2]:
fig = plt.figure(figsize=(10, 6))
ax = fig.gca()
x = np.linspace(-1, 1, 200)
hz = 10
cost = np.sin(hz*x)*np.exp(-x)
ax.plot(x, cost)
ax.set_ylabel('Cost')
ax.set_xlabel('Some Parameter')
Out[2]:
<matplotlib.text.Text at 0x110a5aa58>

We'll never really ever be able to see our entire cost function like this. If we were able to, we'd know exactly what parameter we should use. So we're just imagining that as any parameters in our network change, this is how cost would change. Since we know the value of the cost everywhere, we can easily describe the gradient using np.diff, which will just measure the difference between every value. That's a good approximation of the gradient for this illustration at least.

In [3]:
gradient = np.diff(cost)

If we follow the negative gradient of this function given some randomly intialized parameter and a learning rate:

In [4]:
fig = plt.figure(figsize=(10, 6))
ax = fig.gca()
x = np.linspace(-1, 1, 200)
hz = 10
cost = np.sin(hz*x)*np.exp(-x)
ax.plot(x, cost)
ax.set_ylabel('Cost')
ax.set_xlabel('Some Parameter')
n_iterations = 500
cmap = plt.get_cmap('coolwarm')
c_norm = colors.Normalize(vmin=0, vmax=n_iterations)
scalar_map = cmx.ScalarMappable(norm=c_norm, cmap=cmap)
init_p = 120#np.random.randint(len(x)*0.2, len(x)*0.8)
learning_rate = 1.0
for iter_i in range(n_iterations):
    init_p -= learning_rate * gradient[init_p]
    ax.plot(x[init_p], cost[init_p], 'ro', alpha=(iter_i + 1) / n_iterations, color=scalar_map.to_rgba(iter_i))
/Users/pkmital/.pyenv/versions/3.4.0/Python.framework/Versions/3.4/lib/python3.4/site-packages/ipykernel/__main__.py:17: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/Users/pkmital/.pyenv/versions/3.4.0/Python.framework/Versions/3.4/lib/python3.4/site-packages/ipykernel/__main__.py:16: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

What this would mean is depending on where our random initialization of weights began, our final cost might end up being somewhere around -0.5. This is a local minima. It is, based on its surroundings, a minima. But it is not the global minima. In fact there are a few other possible places the network could have ended up, if our initialization led us to another point first, meaning our final cost would have been different.

This illustration is just for a single parameter... but our networks will often have millions of parameters... I'll illustrate the same idea with just two parameters to give you a sense of how quickly the problem becomes very difficult.

In [5]:
from mpl_toolkits.mplot3d import axes3d
from matplotlib import cm
fig = plt.figure(figsize=(10, 6))
ax = fig.gca(projection='3d')
x, y = np.mgrid[-1:1:0.02, -1:1:0.02]
X, Y, Z = x, y, np.sin(hz*x)*np.exp(-x)*np.cos(hz*y)*np.exp(-y)
ax.plot_surface(X, Y, Z, rstride=2, cstride=2, alpha=0.75, cmap='jet', shade=False)
ax.set_xlabel('Some Parameter 1')
ax.set_ylabel('Some Parameter 2')
ax.set_zlabel('Cost')
# ax.axis('off')
Out[5]:
<matplotlib.text.Text at 0x1114ee400>

It turns out that in practice, as the number of your parameters grows, say to a million, then finding a local minima will more often than not turn out to be very good minima. That's good news for deep networks as we'll often work with that many parameters.

Learning Rate

Another aspect of learning what our parameters should be, is how far along the gradient we should move our parameters? That is also known as learning_rate. Let's see what happens for different values of our learning rate:

In [6]:
fig, axs = plt.subplots(1, 3, figsize=(20, 6))
for rate_i, learning_rate in enumerate([0.01, 1.0, 500.0]):
    ax = axs[rate_i]
    x = np.linspace(-1, 1, 200)
    hz = 10
    cost = np.sin(hz*x)*np.exp(-x)
    ax.plot(x, cost)
    ax.set_ylabel('Cost')
    ax.set_xlabel('Some Parameter')
    ax.set_title(str(learning_rate))
    n_iterations = 500
    cmap = plt.get_cmap('coolwarm')
    c_norm = colors.Normalize(vmin=0, vmax=n_iterations)
    scalar_map = cmx.ScalarMappable(norm=c_norm, cmap=cmap)
    init_p = 120#np.random.randint(len(x)*0.2, len(x)*0.8)
    for iter_i in range(n_iterations):
        init_p -= learning_rate * gradient[init_p]
        ax.plot(x[init_p], cost[init_p], 'ro', alpha=(iter_i + 1) / n_iterations, color=scalar_map.to_rgba(iter_i))
/Users/pkmital/.pyenv/versions/3.4.0/Python.framework/Versions/3.4/lib/python3.4/site-packages/ipykernel/__main__.py:18: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
/Users/pkmital/.pyenv/versions/3.4.0/Python.framework/Versions/3.4/lib/python3.4/site-packages/ipykernel/__main__.py:17: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future

In the first case, our learning rate was way too small. It looks like we didn't manage to get any better cost than where we started! In the second case, just right. In the third case, our learning rate was too large. Meaning, we overshot our minima, and moved past it. So our cost has the effect of going up and down, instead of just going down like in the second case.

We'll learn more tricks for changing this landscape to be a bit more concave, reducing the number of local minima by regularizing the landscape through many different extensions to this same basic idea of following the negative slope of our gradient. Before we can get into them we'll need to learn how to create a neural network.

Creating a Neural Network

Let's try a simple example network. We're going try to find a mapping of an input X to an output y, just like in our example of mapping an input image to either a 0 or 1.

In [7]:
# Let's create some toy data

# We are going to say that we have seen 1000 values of some underlying representation that we aim to discover
n_observations = 1000

# Instead of having an image as our input, we're going to have values from -3 to 3.  This is going to be the input to our network.
xs = np.linspace(-3, 3, n_observations)

# From this input, we're going to teach our network to represent a function that looks like a sine wave.  To make it difficult, we are going to create a noisy representation of a sine wave by adding uniform noise.  So our true representation is a sine wave, but we are going to make it difficult by adding some noise to the function, and try to have our algorithm discover the underlying cause of the data, which is the sine wave without any noise.
ys = np.sin(xs) + np.random.uniform(-0.5, 0.5, n_observations)
plt.scatter(xs, ys, alpha=0.15, marker='+')
Out[7]:
<matplotlib.collections.PathCollection at 0x111499978>

So now we can see that there is a sine wave looking thing but it's really noisy. We want to train a network to say, given any value on the $x$ axis, tell me what the value should be on the $y$ axis. That is the fundamental idea of regression. Predicting some continuous output value given some continuous input value.

Defining Cost

We're going to use tensorflow to train our first network:

In [8]:
# variables which we need to fill in when we are ready to compute the graph.
# We'll pass in the values of the x-axis to a placeholder called X.
X = tf.placeholder(tf.float32, name='X')

# And we'll also specify what the y values should be using another placeholder, y.
Y = tf.placeholder(tf.float32, name='Y')

Now for parameters of our network. We're going to transform our x values, just like we did with an image and filtering it. In order to do that, we're going to multiply the value of x by some unknown value. Pretty simple. So what that lets us do is scale the value coming in. We'll also allow for a simple shift by adding another number. That lets us move the range of values to any new position.

But we need an initial value for our parameters. For that, we're going to use values close to 0 using a gaussian function:

In [9]:
sess = tf.InteractiveSession()
n = tf.random_normal([1000]).eval()
plt.hist(n)
Out[9]:
(array([  25.,   72.,  144.,  208.,  218.,  175.,  109.,   36.,   10.,    3.]),
 array([-2.53853273, -1.93898957, -1.3394464 , -0.73990324, -0.14036007,
         0.4591831 ,  1.05872626,  1.65826943,  2.2578126 ,  2.85735576,
         3.45689893]),
 <a list of 10 Patch objects>)

In order to do that, we can use the tensorflow random_normal function. If we ask for 1000 values and then plot a histogram of the values, we can see that the values are centered around 0 and are mostly between -3 and 3. For neural networks, we will usually want the values to start off much closer to 0. To do that, we can control the standard deviation like so:

In [10]:
n = tf.random_normal([1000], stddev=0.1).eval()
plt.hist(n)
Out[10]:
(array([   5.,   20.,   81.,  223.,  261.,  254.,  107.,   40.,    8.,    1.]),
 array([-0.34643096, -0.27188794, -0.19734492, -0.1228019 , -0.04825888,
         0.02628414,  0.10082716,  0.17537018,  0.2499132 ,  0.32445622,
         0.39899924]),
 <a list of 10 Patch objects>)
In [11]:
# To create the variables, we'll use tf.Variable, which unlike a placeholder, does not require us to define the value at the start of a run/eval.  It does need an initial value, which we'll give right now using the function tf.random_normal.  We could also pass an initializer, which is simply a function which will call the same function.  We'll see how that works a bit later.  In any case, the random_normal function just says, give me a random value from the "normal" curve.  We pass that value to a tf.Variable which creates a tensor object.
W = tf.Variable(tf.random_normal([1], dtype=tf.float32, stddev=0.1), name='weight')

# For bias variables, we usually start with a constant value of 0.
B = tf.Variable(tf.constant([0], dtype=tf.float32), name='bias')

# Now we can scale our input placeholder by W, and add our bias, b.
Y_pred = X * W + B

We're going to use gradient descent to learn what the best value of W and b is. In order to do that, we need to know how to measure what the best is. Let's think about that for a moment. What is it we're trying to do? We're trying to transform a value coming into the network, x, which ranges from values of -3 to 3, to match a known value, Y, which should be a sine wave which ranges from -1 to 1. So any value into the network should make it seem like the network represents a sine wave. Well we know what a sine wave should be. We can just use python to calculate it for us. We just need a function that measures distance:

In [12]:
# this function will measure the absolute distance, also known as the l1-norm
def distance(p1, p2):
    return tf.abs(p1 - p2)
In [13]:
# and now we can take the output of our network and our known target value
# and ask for the distance between them
cost = distance(Y_pred, tf.sin(X))

This function is just saying, give me the distance from the predicted value to the assumed underlying sine wave value. But let's say this was some natural occuring data in the world. Or a more complex function, like an image of oranges or apples. We don't know what the function is that determines whether we perceive an image as an apple or orange.

In [14]:
# cost = distance(Y_pred, ?)

But we do have a limited set of data that says what a given input should output. That means we can still learn what the function might be based on the data.

So instead of our previous cost function, we'd have:

In [15]:
cost = distance(Y_pred, Y)

where Y is the true Y value.

Now it doesn't matter what the function is. Our cost will measure the difference to the value we have for the input, and try to find the underlying function. Lastly, we need to sum over every possible observation our network is fed as input. That's because we don't give our network 1 x value at a time, but generally will give it 50-100 or more examples at a time.

In [16]:
cost = tf.reduce_mean(distance(Y_pred, Y))

Training Parameters

Let's see how we can learn the parameters of this simple network using a tensorflow optimizer.

In [17]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)

<TODO: Describe Train/Test sets, use test set to visualize rest of number line>

We tell the optimizer to minimize our cost variable which measures the distance between the prediction and actual Y value. The optimizer knows how to calculate the gradient and follow it in the negative direction to find the smallest value, and handles updating all variables!

We now just need to iteratively run the optimizer, just like we would run/eval any other part of our tensorflow graph.

In [18]:
# We create a session to use the graph
n_iterations = 500

# Plot the true data distribution
fig, ax = plt.subplots(1, 1)
ax.scatter(xs, ys, alpha=0.15, marker='+')
with tf.Session() as sess:
    # Here we tell tensorflow that we want to initialize all
    # the variables in the graph so we can use them
    # This will set `W` and `b` to their initial random normal value.
    sess.run(tf.global_variables_initializer())

    # We now run a loop over epochs
    prev_training_cost = 0.0
    for it_i in range(n_iterations):
        sess.run(optimizer, feed_dict={X: xs, Y: ys})
        training_cost = sess.run(cost, feed_dict={X: xs, Y: ys})

        # every 10 iterations
        if it_i % 10 == 0:
            # let's plot the x versus the predicted y
            ys_pred = Y_pred.eval(feed_dict={X: xs}, session=sess)

            # We'll draw points as a scatter plot just like before
            # Except we'll also scale the alpha value so that it gets
            # darker as the iterations get closer to the end
            ax.plot(xs, ys_pred, 'k', alpha=it_i / n_iterations)
            fig.show()
            plt.draw()

            # And let's print our training cost: mean of absolute differences
            print(training_cost)

        # Allow the training to quit if we've reached a minimum
        if np.abs(prev_training_cost - training_cost) < 0.000001:
            break

        # Keep track of the training cost
        prev_training_cost = training_cost
Exception ignored in: <bound method InteractiveSession.__del__ of <tensorflow.python.client.session.InteractiveSession object at 0x1065377b8>>
Traceback (most recent call last):
  File "/Users/pkmital/.pyenv/versions/3.4.0/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 171, in __del__
    self.close()
  File "/Users/pkmital/.pyenv/versions/3.4.0/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 976, in close
    self._default_session.__exit__(None, None, None)
  File "/Users/pkmital/.pyenv/versions/3.4.0/Python.framework/Versions/3.4/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/Users/pkmital/.pyenv/versions/3.4.0/Python.framework/Versions/3.4/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 3378, in get_controller
    % type(default))
AssertionError: Nesting violated for default stack of <class 'weakref'> objects
/Users/pkmital/.pyenv/versions/3.4.0/Python.framework/Versions/3.4/lib/python3.4/site-packages/matplotlib/figure.py:397: UserWarning: matplotlib is currently using a non-GUI backend, so cannot show the figure
  "matplotlib is currently using a non-GUI backend, "
1.01381
0.938715
0.866343
0.795964
0.730013
0.669383
0.616252
0.571913
0.534671
0.504404
0.481598
0.464451
0.45184
0.442368
0.434755
0.428502
0.423631
0.419577
0.416295
0.413903
0.41246
0.41139
0.410728
0.410171
0.409726
0.409379
0.409104
0.408834
0.408617
0.408457
0.40836
0.4083
0.408257
0.408229

After running the code, we should see our original noisy data. We call that the training data since it is training the network. And we see the output of the network as a solid black line.

Now you might be thinking, wait, that looks like nothing like a sine wave... I mean it has got the general trend of the line I guess. But it doesn't curve at all! We're going to get into why that is in a moment.

But first, we're going to have to learn a bit more about the different between training and testing networks.

Training vs. Testing

<TODO:>

Stochastic and Mini Batch Gradient Descent

Now remember when I said our cost manifold would have many local minima, and that we'd learn some tricks to help us find the best one? Well now we're ready to talk about two ways of helping with that. One is using what are called mini-batches. This is useful for a number of reasons. First, it is generally impractical to look at an entire dataset. You might have millions of images which you couldn't ever try loading all at once into a network. Instead you would look at some random subset of them. Second, we avoid trying to generalize our entire dataset, allowing us to navigate through more fine grained terrain. In order to use mini batches, we simply iterate through our entire dataset, batch_size at a time:

In [19]:
idxs = np.arange(100)
batch_size = 10
n_batches = len(idxs) // batch_size
for batch_i in range(n_batches):
    print(idxs[batch_i * batch_size : (batch_i + 1) * batch_size])
[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 25 26 27 28 29]
[30 31 32 33 34 35 36 37 38 39]
[40 41 42 43 44 45 46 47 48 49]
[50 51 52 53 54 55 56 57 58 59]
[60 61 62 63 64 65 66 67 68 69]
[70 71 72 73 74 75 76 77 78 79]
[80 81 82 83 84 85 86 87 88 89]
[90 91 92 93 94 95 96 97 98 99]

It turns out that this is not the best idea, because we're always looking at the same order of our dataset. Neural networks love order. They will pick up on any order you give it and use that to its advantage. But the order of the data is entirely irrelevant to our problem. In some cases, it may turn out to be exactly what we want to do. For instance, if we want to learn about how something changes over time, like audio, or letters or words in a sequence which form sentences. Then we will have to make sure we're sending data in a certain order. But for now, we really want to avoid using any order.

So we'll have to randomly permute the indexes of our dataset like so:

In [20]:
rand_idxs = np.random.permutation(idxs)
batch_size = 10
n_batches = len(rand_idxs) // batch_size
print('# of batches:', n_batches)
for batch_i in range(n_batches):
    print(rand_idxs[batch_i * batch_size : (batch_i + 1) * batch_size])
# of batches: 10
[47 46 50  8 76 63 87 66 72 11]
[95 79 38 16 78 71 60 61 20 59]
[ 2 48  7 67 45 27 53 73 77 57]
[98 44 37 51 33 43 35 12 58 96]
[85 69 82 30 31 49 88 32 83  6]
[75 24 93 55 15 13 91 39 17 62]
[ 4 81 86 64 40 41 90 10 70 42]
[68 14 21 25  3 92 28 94 65 99]
[97 74  5 52 18 54 80 89 29 84]
[22  0  9  1 23 19 26 34 56 36]

What we've done above is look at a range of 100 possible indexes by chunking them into batch_size at a time. But we've also randomized the order we've looked at them so that we aren't prioritizing learning one part of a dataset over another.

We can implement this into our previous code like so:

In [21]:
batch_size = 1000
fig, ax = plt.subplots(1, 1)
ax.scatter(xs, ys, alpha=0.15, marker='+')
ax.set_xlim([-4, 4])
ax.set_ylim([-2, 2])
with tf.Session() as sess:
    # Here we tell tensorflow that we want to initialize all
    # the variables in the graph so we can use them
    # If we had used tf.random_normal_initializer or tf.constant_intitializer,
    # then this would have set `W` and `b` to their initial values.
    sess.run(tf.global_variables_initializer())

    # We now run a loop over epochs
    prev_training_cost = 0.0
    for it_i in range(n_iterations):
        idxs = np.random.permutation(range(len(xs)))
        n_batches = len(idxs) // batch_size
        for batch_i in range(n_batches):
            idxs_i = idxs[batch_i * batch_size: (batch_i + 1) * batch_size]
            sess.run(optimizer, feed_dict={X: xs[idxs_i], Y: ys[idxs_i]})

        training_cost = sess.run(cost, feed_dict={X: xs, Y: ys})

        if it_i % 10 == 0:
            ys_pred = Y_pred.eval(feed_dict={X: xs}, session=sess)
            ax.plot(xs, ys_pred, 'k', alpha=it_i / n_iterations)
            print(training_cost)
fig.show()
plt.draw()
1.05779
0.986123
0.915648
0.846516
0.778799
0.71353
0.653804
0.602119
0.559227
0.523675
0.495648
0.474023
0.458699
0.447444
0.438839
0.431773
0.42621
0.421696
0.418087
0.415061
0.413231
0.411967
0.41108
0.410476
0.409969
0.409551
0.409259
0.408984
0.408733
0.408536
0.408409
0.408328
0.408278
0.408242
0.408221
0.408211
0.4082
0.40819
0.408181
0.408177
0.408175
0.408173
0.408171
0.408169
0.408169
0.408169
0.408169
0.408169
0.408169
0.408169

The resulting process is also know as Mini-Batch Gradient Descent, since we are taking smaller batches of our data and performing gradient descent. Further, it is Stochastic, meaning the order of the data presented is randomized, and is also commonly referred to as Stochastic Gradient Descent. When the two ideas are combined, we have the best of both worlds: the mini batch part which allows us to get more stable updates; and the stochastic part which allows us to move to different parts of our cost's manifold entirely. I'll just use Gradient Descent as we'll always want it to be in batches, and to be the stochastic kind.

Let's stick all of the code necessary for training into a function so we don't have to type it out again:

In [22]:
def train(X, Y, Y_pred, n_iterations=100, batch_size=200, learning_rate=0.02):
    cost = tf.reduce_mean(distance(Y_pred, Y))
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
    fig, ax = plt.subplots(1, 1)
    ax.scatter(xs, ys, alpha=0.15, marker='+')
    ax.set_xlim([-4, 4])
    ax.set_ylim([-2, 2])
    with tf.Session() as sess:
        # Here we tell tensorflow that we want to initialize all
        # the variables in the graph so we can use them
        # This will set W and b to their initial random normal value.
        sess.run(tf.global_variables_initializer())

        # We now run a loop over epochs
        prev_training_cost = 0.0
        for it_i in range(n_iterations):
            idxs = np.random.permutation(range(len(xs)))
            n_batches = len(idxs) // batch_size
            for batch_i in range(n_batches):
                idxs_i = idxs[batch_i * batch_size: (batch_i + 1) * batch_size]
                sess.run(optimizer, feed_dict={X: xs[idxs_i], Y: ys[idxs_i]})

            training_cost = sess.run(cost, feed_dict={X: xs, Y: ys})

            if it_i % 10 == 0:
                ys_pred = Y_pred.eval(feed_dict={X: xs}, session=sess)
                ax.plot(xs, ys_pred, 'k', alpha=it_i / n_iterations)
                print(training_cost)
    fig.show()
    plt.draw()

To get closer to a sine wave, we're going to have to be able to do more than simply scale our input with a multiplication! What if we had a lot more parameters? What if we have 10 different multiplications of the input? What does your intuition tell you? How are 10 more multiplications combined? 1000? A million?

In [23]:
# We're going to multiply our input by 100 values, creating an "inner layer"
# of 100 neurons.
n_neurons = 100
W = tf.Variable(tf.random_normal([1, n_neurons], stddev=0.1))

# and allow for n_neurons additions on each of those neurons
b = tf.Variable(tf.constant(0, dtype=tf.float32, shape=[n_neurons]))

# Instead of multiplying directly, we use tf.matmul to perform a
# matrix multiplication
h = tf.matmul(tf.expand_dims(X, 1), W) + b

# Create the operation to add every neuron's output
Y_pred = tf.reduce_sum(h, 1)

# Retrain with our new Y_pred
train(X, Y, Y_pred)
2.00709
1.1381
2.04806
2.68846
2.75491
2.20173
1.78201
1.44842
2.29706
1.82505

First, the training takes a lot longer! That's because our network is much larger. By adding 100 neurons, we've added 100 more multiplications, and 100 more additions for every observation. Since we have 10000 observations, that's $(100 + 100) * 10000$ more computations just for computing the output, or 100 million more computations. But that's not all. We also need to compute the gradients of every parameter! Having all of these extra parameters makes it much harder to find the best solution. So as our network expands, the amount of memory and computation will grow very fast, and training becomes more difficult.

Despite increasing everything about our network, looking at the cost, we're not doing much better! Why is that? Well, we've added a lot more multiplications. But it hasn't changed the fact that our function is still just a linear function. Multiplying a millions times wouldn't help but instead just make it harder to find the same solution we found with far less parameters. But also, the cost is going up and down, instead of just down. That's a good sign that we should probably reduce the learning rate.

Input's Representation

In order to get more complexity, we could consider changing our input's representation. For instance, if you are working with sound, it may not be the best idea to think about representing the sound as a signal, and instead you might want to explore using something like the discrete fourier transform. Or if you're working with text, there may be other representations that will allow you to learn more useful features of your data such as word histograms. There may be other possibilities depending on your application.

Over vs. Underfitting

One technique for representing curved data like a sine wave is to consider the different polynomials of your input.

In [24]:
# Instead of a single factor and a bias, we'll create a polynomial function
# of different degrees.  We will then learn the influence that each
# degree of the input (X^0, X^1, X^2, ...) has on the final output (Y).
Y_pred = tf.Variable(tf.random_normal([1]), name='bias')
for pow_i in range(0, 2):
    W = tf.Variable(
        tf.random_normal([1], stddev=0.1), name='weight_%d' % pow_i)
    Y_pred = tf.add(tf.multiply(tf.pow(X, pow_i), W), Y_pred)

# And then we'll retrain with our new Y_pred
train(X, Y, Y_pred)
1.61565
0.438559
0.408211
0.40817
0.408169
0.408174
0.408173
0.408171
0.408172
0.40817

If we use the 0th and 1st expansion, that is $x^0$, which just equals 1, and $x^1$. So $1 * W_1 + x * W_2$. That's exactly the same as what we've just done. It starts to get interesting once we add more powers:

In [25]:
# Instead of a single factor and a bias, we'll create a polynomial function
# of different degrees.  We will then learn the influence that each
# degree of the input (X^0, X^1, X^2, ...) has on the final output (Y).
Y_pred = tf.Variable(tf.random_normal([1]), name='bias')
for pow_i in range(0, 4):
    W = tf.Variable(
        tf.random_normal([1], stddev=0.1), name='weight_%d' % pow_i)
    Y_pred = tf.add(tf.multiply(tf.pow(X, pow_i), W), Y_pred)

# And then we'll retrain with our new Y_pred
train(X, Y, Y_pred)
0.487767
0.333574
0.412557
0.455543
0.4811
0.50941
0.518239
0.712861
0.622636
0.510807

But we really don't want to add too many powers. If we add just 1 more power:

In [26]:
# Instead of a single factor and a bias, we'll create a polynomial function
# of different degrees.  We will then learn the influence that each
# degree of the input (X^0, X^1, X^2, ...) has on the final output (Y).
Y_pred = tf.Variable(tf.random_normal([1]), name='bias')
for pow_i in range(0, 5):
    W = tf.Variable(
        tf.random_normal([1], stddev=0.1), name='weight_%d' % pow_i)
    Y_pred = tf.add(tf.multiply(tf.pow(X, pow_i), W), Y_pred)

# And then we'll retrain with our new Y_pred
train(X, Y, Y_pred)
0.893144
1.30899
1.90326
2.21289
3.37598
2.84538
2.36137
3.21127
2.67461
2.17921

The whole thing is completely off. In general, a polynomial expansion is hardly ever useful as it requires us to know what the underlying function is, meaning, what order polynomial is it?

Introducing Nonlinearities / Activation Function

How else can we get our line to express the curves in our data? What we'll explore instead is what happens when you add a non-linearity, which you might also hear be called an activation function. That is a really essential ingredient to any deep network. Practically every complex deep learning algorithm performs a series of linear, followed by nonlinear operations. By stacking sets of these, the complexity and power of expression grows far greater than any linear equation could.

We'll typically make use of one of three non-linearities for the rest of this course:

In [27]:
sess = tf.InteractiveSession()
x = np.linspace(-6,6,1000)
plt.plot(x, tf.nn.tanh(x).eval(), label='tanh')
plt.plot(x, tf.nn.sigmoid(x).eval(), label='sigmoid')
plt.plot(x, tf.nn.relu(x).eval(), label='relu')
plt.legend(loc='lower right')
plt.xlim([-6, 6])
plt.ylim([-2, 2])
plt.xlabel('Input')
plt.ylabel('Output')
plt.grid('on')

What each of these curves demonstrates is how instead of just multiplying an input by a number, creating another line, we non-linearly multiply the input value. That just means we will multiply our input by a different value depending on what the input value is. This allows us to express very complex ideas. If we do this enough times, we can express anything. Let's see how we can do this

In [28]:
# We're going to multiply our input by 10 values, creating an "inner layer"
# of n_neurons neurons.
n_neurons = 10
W = tf.Variable(tf.random_normal([1, n_neurons]), name='W')

# and allow for n_neurons additions on each of those neurons
b = tf.Variable(tf.constant(0, dtype=tf.float32, shape=[n_neurons]), name='b')

# Instead of just multiplying, we'll put our n_neuron multiplications through a non-linearity, the tanh function.
h = tf.nn.tanh(tf.matmul(tf.expand_dims(X, 1), W) + b, name='h')

Y_pred = tf.reduce_sum(h, 1)

# And retrain w/ our new Y_pred
train(X, Y, Y_pred)
2.25565
0.276184
0.265378
0.260378
0.257955
0.256217
0.255263
0.254435
0.253936
0.253369

<TODO: Graphic of fully connected network, matrix>

It turns out that multiplying our input by a matrix, adding a bias, and then applying a non-linearity is something we'll need to do a lot. It's often called a fully-connected network, since everything is connected to everything else, meaning every neuron is multiplied by every single input value. This is also sometimes called a linear layer, since we are linearly combining the values of the input to create the resulting neuron.

You might have seen this depicted like so:

Going Deeper

Let's write a simply function for creating the same type of network as above:

In [29]:
def linear(X, n_input, n_output, activation=None):
    W = tf.Variable(tf.random_normal([n_input, n_output], stddev=0.1), name='W')
    b = tf.Variable(
        tf.constant(0, dtype=tf.float32, shape=[n_output]), name='b')
    h = tf.nn.tanh(tf.matmul(X, W) + b, name='h')
    return h

Let's now take a look at what the tensorflow graph looks like when we create this type of connection:

In [30]:
# first clear the graph
from tensorflow.python.framework import ops
ops.reset_default_graph()

# let's get the current graph
g = tf.get_default_graph()

# See the names of any operations in the graph
[op.name for op in tf.get_default_graph().get_operations()]

# let's create a new network
X = tf.placeholder(tf.float32, name='X')
h = linear(X, 2, 10)

# See the names of any operations in the graph
[op.name for op in tf.get_default_graph().get_operations()]
Out[30]:
['X',
 'random_normal/shape',
 'random_normal/mean',
 'random_normal/stddev',
 'random_normal/RandomStandardNormal',
 'random_normal/mul',
 'random_normal',
 'W',
 'W/Assign',
 'W/read',
 'Const',
 'b',
 'b/Assign',
 'b/read',
 'MatMul',
 'add',
 'h']

The names of the variables in this network aren't very helpful. We can actually do a much better job here by creating our variables within scopes:

In [31]:
def linear(X, n_input, n_output, activation=None, scope=None):
    with tf.variable_scope(scope or "linear"):
        W = tf.get_variable(
            name='W',
            shape=[n_input, n_output],
            initializer=tf.random_normal_initializer(mean=0.0, stddev=0.1))
        b = tf.get_variable(
            name='b',
            shape=[n_output],
            initializer=tf.constant_initializer())
        h = tf.matmul(X, W) + b
        if activation is not None:
            h = activation(h)
        return h

We've also moved from using a single random value, to using an initializer. This initializer will create a new random value every time we call sess.run(tf.global_variables_initializer()). We also pass some more sensible values for the initial mean and standard deviation.

Now let's look at the graph:

In [32]:
# first clear the graph
from tensorflow.python.framework import ops
ops.reset_default_graph()

# let's get the current graph
g = tf.get_default_graph()

# See the names of any operations in the graph
[op.name for op in tf