diff --git a/README.md b/README.md deleted file mode 100644 index 1f8c5c8..0000000 --- a/README.md +++ /dev/null @@ -1,27 +0,0 @@ -# IPython Theano Tutorials - -A collection of tutorials in ipynb format that illustrate how to do various -things in Theano. - -# General - -* Theano Basics -* Adding a custom Op to Theano -* Numpy/Python function minimization using pyautodiff - - -## Machine Learning: - -### Supervised Algorithms -* Logistic Regression -* Multilayer Perceptron (MLP) -* Convolutional Network (Convnet) -* Deep Belief Network (DBN) - -### Unsupervised Algorithms - -* Restricted Boltzmann Machine (RBM) -* Autoassociator / Autoencoder (AA) -* Stochasitc Denoising auto associator (SDAA) -* Sparse coding - diff --git a/README.txt b/README.txt new file mode 100644 index 0000000..8e7f6bd --- /dev/null +++ b/README.txt @@ -0,0 +1,29 @@ +README OF GH-PAGES BRANCH OF IPYTHON THEANO TUTORIALS +===================================================== + +The pages published for the ipython notebooks works like this: + +* index.html is auto-built using the github pages wizard thing + +* All other html files are built from ipynb files in the 'master' branch + + +## Rebuilding index.html + +Rebuild the index.html file by editing the index.html directly. +DO NOT regenerate it using the github wizard, it will erase everything else!! + + +## Rebuilding all the notebook html files + +1. Install nbconvert (https://github.com/ipython/nbconvert) + +2. Set up this project branch in a subfolder (recommended: "html") of the +IPythonTheanoTutorials master branch. + +3. run ./rebuild_pages.sh + +4. commit the changed html pages to [this] gh-pages branch to make them appear +on the website. + + diff --git a/images/arrow-down.png b/images/arrow-down.png new file mode 100644 index 0000000..585b0bd Binary files /dev/null and b/images/arrow-down.png differ diff --git a/images/octocat-small.png b/images/octocat-small.png new file mode 100644 index 0000000..66c2539 Binary files /dev/null and b/images/octocat-small.png differ diff --git a/index.html b/index.html new file mode 100644 index 0000000..0b2c32d --- /dev/null +++ b/index.html @@ -0,0 +1,125 @@ + + +
+ + +(test pages for)
+ +This project is maintained by jaberg
+ + +A collection of tutorials in ipynb format that illustrate how to do various +things in Theano.
+ ++
+
+
+
+
Download and unpack this project, and start up an ipython notebook to browse +through the tutorials.
+ +git clone https://github.com/jaberg/IPythonTheanoTutorials.git + cd IPythonTheanoTutorials + sh start_ipython_server.sh
+ +Before we dive into deep learning algorithms, lets make sure we understand the basics of the Python language and the IPython environment. Python is an interpreted language, and the iPython environment (powering this notebook) makes Python easy to learn, and fun to play with.
+If you're new to Python or the IPython notebook, get started by running the following code fragment. You can run it by clicking on the code and either:
+print "Hello Random World"
+seed(123)
+imshow(rand(10, 10))
+You can also edit the code and run it again the same way. The new output will replace the old output.
+If you want to play around with Python, you might take the following as a point of departure:
+a = 'foo' # -- strings are basic data types
+if 'foo' == "foo":
+ # single-quoted strings and double-quoted strings are the same
+ print """
+ In Python, "foo" and 'foo' are the same.
+ And this is a multi-line string literal.
+ """
+
+b = 3.4 # -- floating point literal
+a = 5 # -- integer literal; notice `a` can be re-used for different types
+
+c = (1, 2, 3) # -- tuples are immutable sequences
+d = [1, 2, 3] # -- lists are mutable sequences
+
+e = {'a': 6, 5: 8, c: 'foo'} # -- dictionary-creation syntax
+# -- define a function like this
+def fn(x):
+ return x + 2
+
+# -- sometimes you might see nested functions
+def f(x):
+ def g(y):
+ return x + y
+ return g
+
+add_2 = f(2)
+print 'add_2(5) =', add_2(5)
+
+# -- scoping - parameters mask variables
+x = 7 # -- global x
+def g(x):
+ # -- local scope - here `x` refers to the argument of g, not the global.
+ print 'local x:', x
+g(12)
+print 'global x:', x
+# -- positional vs. named arguments
+def multi(a, b, c=8):
+ # -- the `%` does string formatting
+ print 'multi a=%s b=%s c=%s' % (a, b, c)
+
+# -- all parameters can be assigned positionally or by name
+multi(1, 2, 3)
+multi(4, 5)
+multi(4, 5, c=12)
+multi(a=-1, b='foo', c='bar')
+
+# -- positional arguments can be pulled from a tuple or list
+args = ('do', 're', 'mi')
+multi(*args)
+
+# -- keyword arguments can be pulled from a dictionary
+kwargs = {'a': 0, 'b': -1, 'c': 1}
+multi(**kwargs)
+
+# -- all these can be combined!
+multi(8, *(88,), **{'c': 888})
+# -- argument unpacking implements multiple return values
+def rmulti(a):
+ return (a, a + 1)
+
+x, y = rmulti(4)
+print x
+print y
+There are many resources to help you get oriented to use Python for +numeric scientific work. A quick bit of googling turned up the following ones, +but feel free to look further.
+Unix/Python/NumPy - High-level intro to unix and Python.
+NumPy basics - Semi-official NumPy tutorial
+SciPy Getting Started - Semi-official SciPy tutorial
+The data sets we will use throughout these tutorials are provided via the +skdata package. The skdata package provides the logic of downloading, +unpacking, and providing a meaningful Python interface to various public data +sets.
+Running the following should show a picture of a boat, though it may have to first download the CIFAR-10 XXX data set in order to do it.
+Once the data set has been downloaded, it is stored in ~/.skdata/cifar10 for re-use. Browse through the data set by changing the index into x,
+or modify the code fragment to show more than one picture at a time. We'll also be using MNIST XXX and the Google Street View House Numbers (SVHN XXX) data sets.
from skdata.cifar10.views import OfficialImageClassificationTask
+task = OfficialImageClassificationTask()
+print 'Image shape', task.train.x[8].shape
+imshow(task.train.x[8], interpolation='nearest')
+from skdata.mnist.views import OfficialImageClassification
+task = OfficialImageClassification()
+print 'Digit shape', task.train.x[0].shape
+imshow(task.train.x[0][:,:,0], cmap=cm.gray, interpolation='nearest')
+from skdata.svhn.view import CroppedDigitsView2
+data_view = CroppedDigitsView2(x_dtype='float32', n_train=1000)
+imshow(data_view.train.x[0].reshape(32, 32, 3))
+How many examples are in each data set, how big is each one? How many floating point values are in each image?
+# Write your answer here
+How are the examples of each data set sorted? Stochastic gradient descent is not very efficient when it iterates through all the examples of one class, then all the examples of the next class, etc. Do any of our data sets need to be shuffled?
+# Write your answer here
+How are the pixel colors encoded in each data set? How are the labels encoded? Tip: the labels are in train.y
# Write your answer here
+The task objects have a test attribute as well as the train attribute which contains the test set for estimating generalization by cross-validation. How big are the test sets?
# Write your answer here
+Something a bit more advanced - write a code fragment to lot the mean and variance in pixel values over the data set. Start with MNIST, which is grey-scale. How would you do it for a color data set? How do the pictures change when you look at per-class statistics?
+# Write your answer here
+There are a lot of resources to help you do deep learning in Python. I've organized this page into sections: + * Deep Learning Software + * Scientific Software in Python + * Scientific Python Tutorials
+There are several software packages to be aware of, which do not necessarily have Python bindings but which may grow to include them. You don't always have to use Python!
+Torch7 - Deep Learning software based on C and Lua
+CudaConvNet - Alex Krizhevsky's software for deep convolutional nets.
+PyLearn2 - Deep Learning Framework in Python based on Theano.
+Deep learning can take your research in many directions, and it's nice that +Python has a lot of projects to help you on your way:
+Python - the standard library is surprisingly comprehensive, get to + know what's in it.
+NumPy - the defacto official n-dimensional array data type for Python. + Doing numerical work in Python, you will use these all the time.
+SciPy - a library of many algorithms and utilities + that are useful across a broad range of scientific work. I originally captioned the following "especially useful" but the list grew to + include almost the entire library:
+scipy.io - read and write various matrix formats including MATLAB
+scipy.linalg - decompositions, inverses, etc.
+scipy.ndimage - basic image processing
+scipy.optimize - 1-D and N-D optimization algorithms
+scipy.signal - signal processing
+scipy.sparse - several sparse matrix types
+scipy.special - special functions (e.g. erf)
+scipy.stats - pdfs, cdfs, sampling algorithms, tests
+IPython is a feature-rich Python interpreter.
+Matplotlib is the most widely-used plotting library for Python. It provides an + interface similar to that of MATLAB or R. It is sometimes annoying, and not as flashy as say, + d3 but it is the workhorse of data visualization in Python.
+Theano - an array expression compiler, with automatic differentiation and behind-the-scenes GPU execution.
+PyAutoDiff - provides automatic differentiation for code using NumPy. + Currently this is a Theano front-end, but the API has been designed to keep Theano out of the sight of client code.
+pandas - machine learning algorithms and types, mainly for working with time series. + Includes R-like data structures (like R's data frame).
+Cython - a compiler for a more strongly-typed Python dialect, very useful for optimizing numeric Python code.
+Copperhead - compiles natural python syntax to GPU kernels
+numexpr - compiles expression strings to perform loop-fused element-wise computations
+scikit-learn - well-documented and well-tested implementations of many standard ML algorithms.
+scikits-image - image-processing (edge detection, color spaces, many standard algorithms)
+skdata - data sets for machine learning
+There are several great resources to help you get oriented to use Python for +numeric scientific work.
+Unix/Python/NumPy - This + is a really high-level intro to unix and Python. There is tons of + documentation for these things and they're kind of out of scope of these tutorials--by all means dig around the net for more background if you want + to know more.
+Deep Learning Tutorials provide code-centered introductions to + Deep Learning algorithms and architectures. + They make heavy use of Theano, and illustrate good practices + of programming with Theano.
+The auto-encoder (AE) is a classic unsupervised algorithm for non-linear dimensionality reduction. +The de-noising auto-encoder (DAE) is an extension of the auto-encoder introduced as a building block for deep networks in [Vincent08]. +This tutorial introduces the autoencoder as an unsupervised feature learning algorithm for the MNIST data set, and then develops the denoising autoencoder. +See section 4.6 of [Bengio09]_ for an overview of auto-encoders.
+# Before we get started with the rest of the tutorial,
+# run this once (and again after every kernel restart)
+# to bring in all the external symbols we'll need.
+
+from functools import partial
+import logging
+import sys
+import time
+
+import numpy as np
+from numpy import dot, exp, log, newaxis, sqrt, tanh
+rng = np.random.RandomState(123)
+
+from skdata import mnist, cifar10, svhn
+import autodiff
+
+from util import show_filters
+print '-> Imports Complete'
+Let's also import the MNIST data set here, so that the examples in the tutorial have some data to work with. (HINT: Some of the exercises will involve using different data sets -- you can use different data sets with the tutorial code by writing a code block like this one that redefining x and x_img_res.)
dtype = 'float32' # -- this sets the working precision for data and model
+n_examples = 10000 # -- use up to 50000 examples
+
+data_view = mnist.views.OfficialImageClassification(x_dtype=dtype)
+x_as_images = data_view.train.x[:n_examples]
+x_img_res = x_as_images.shape[1:3] # -- handy for showing filters
+x = x_as_images.reshape((n_examples, -1)) # -- rasterize images to vectors
+n_visible = x.shape[1]
+
+# -- uncomment this line to see the initial weight values
+show_filters(x[:100], x_img_res, (10, 10))
+Generally speaking, an autoencoder maps vector input $\mathbf{x}$ to some intermediate representation $\mathbf{y}$ and then back into the original space, in such a way that the end-point is close to $\mathbf{x}$. While the goal thus defined is simply to learn an identity function, +things get interesting when the mappings are parameterized or constrained in such a way that a general identity function is either impossible to represent or at least difficult to learn from data. When this is the case, the goal of learning is to learn a special-purpose identity function that typically works for vectors $\mathbf{x}$ that we care about, which come from some empirically interesting distribution. The $\mathbf{y}$ vector that comes out of this process contains all the important information about $x$ in a new and potentially useful way.
+In our tutorial here we will deal with vectors $\mathbf{x}$ that come from the MNIST data set of hand-written digits. +Examples from MNIST are vectors $\mathbf{x} \in [0,1]^N$, +and we will look at the classic one-layer sigmoidal autoencoder parameterized by:
+which encodes vectors $\mathbf{x}$ into $\mathbf{y} \in [0,1]^M$ by the deterministic mapping
+$$ \mathbf{h} = s(\mathbf{x}\mathbf{W} + \mathbf{b}) $$
+Where $s$ is a poinwise sigmoidal function $s(u) = 1 / (1 + e^{-u})$. +The latent representation $\mathbf{h}$, +or code is then mapped back (decoded) into +reconstruction $\mathbf{z}$ through the similar transformation
+$$\mathbf{z} = s(\mathbf{h}\mathbf{V} + \mathbf{c}) $$.
+# Run this cell to define the symbols
+def logistic(u):
+ """Return logistic sigmoid of float or ndarray `u`"""
+ return 1.0 / (1.0 + exp(-u))
+
+def autoencoder(W, b, V, c, x):
+ h = logistic(dot(x, W) + b)
+ # -- using w.T here is called using "tied weights"
+ # -- using a second weight matrix here is called "untied weights"
+ z = logistic(dot(h, V) + c)
+ return z, h
+
+print '-> defined model family'
+Training an autoencoder consists of minimizing a reconstruction cost, +taking $\mathbf{z}$ as the reconstruction of $\mathbf{x}$. +Intuitively, we want a model that reconstructs the important aspects of $\mathbf{x}$ +so that we guarantee that these aspects are preserved by $\mathbf{y}$. +Typically, mathematical formalizations of this intuition are not particularly +exotic, but they should, at the very least, respect the domain and range of $\mathbf{x}$ +and $\mathbf{z}$ respectively. +So for example, if $\mathbf{x}$ were a real-valued element of $\mathbb{R}^N$ then +we might formalize the reconstruction cost as +$|| \mathbf{x} - \mathbf{z} ||^2$, but there is no mathematical requirement +to use the $\ell_2$ norm or indeed to use a valid norm at all.
+For our MNIST data, we will suppose that $\mathbf{x}$ is a vector Bernoulli probabilities, +and define the reconstruction cost to be the cross-entropy betwen $\mathbf{z}$ and $\mathbf{x}$:
+$$ +L(\mathbf{x}, \mathbf{z}) = - \sum^d_{k=1}[\mathbf{x}_k \log + \mathbf{z}_k + (1 - \mathbf{x}_k)\log(1 - \mathbf{z}_k)] +$$
+We write this in Python as:
+def cross_entropy(x, z):
+ # -- N.B. this is numerically bad, we're counting on Theano to fix up
+ return -(x * log(z) + (1 - x) * log(1 - z)).sum(axis=1)
+
+
+def training_criterion(W, b, V, c, x):
+ z, h = autoencoder(W, b, V, c, x)
+ L = cross_entropy(x, z).mean()
+ return L
+
+print '-> defined training_criterion'
+Training an autoencoder is an optimization problem that is +typically treated in two steps:
+Initialization +The parameters of this particular model +($W, \mathbf{b}, V, \mathbf{c}$) must be initialized +so that symmetry is broken between the columns of $W$ (also the rows of $V$), +otherwise the gradient on each column will be identical and local search will +be completely ineffective.
+It is customary to initialize $W$ to small random values to break symmetry, +and to initialize the rest of the parameters to 0.
+# -- allocate and initialize a new model (w, visbias, hidbias)
+
+# -- tip: choose the number of hidden units as a pair, so that show_filters works
+n_hidden2 = (10, 10)
+n_hidden = np.prod(n_hidden2)
+W = rng.uniform(
+ low=-4 * np.sqrt(6. / (n_hidden + n_visible)),
+ high=4 * np.sqrt(6. / (n_hidden + n_visible)),
+ size=(n_visible, n_hidden)).astype(dtype)
+V = np.zeros_like(W.T)
+b = np.zeros(n_hidden).astype(dtype)
+c = np.zeros(n_visible).astype(dtype)
+
+# -- uncomment this line to see the initial weight values
+show_filters(W.T, x_img_res, n_hidden2)
+Local Search
+The training criterion $L$ defined above is a deterministic and differentiable function of parameters ($W, b, V, c$) so any multi-dimensional minimization algorithm can do the job. +The speed and accuracy of a the method generally depends on the size nature of the data set, the parameters and parameterization of the model, and of course the sophistication of the minimization algorithm.
+A classic, and still widely useful, algorithm is stochastic gradient descent.
+t0 = time.time()
+online_batch_size = 1 # -- =1 for "pure online", >1 for "minibatch"
+streams={'x': x.reshape((
+ n_examples / online_batch_size,
+ online_batch_size,
+ n_visible))}
+W, b, V, c = autodiff.fmin_sgd(training_criterion,
+ args=(W, b, V, c),
+ streams=streams,
+ stepsize=0.001,
+ loops=1, # -- fmin_sgd can iterate over the streams repeatedly
+ print_interval=1000,
+ )
+
+show_filters(W.T, x_img_res, n_hidden2)
+print 'Online training took %f seconds' % (time.time() - t0)
+Another powerful minimization method is the well-known (L-BFGS) algorithm. When used as a training algorithm, it is called a batch algorithm because it does not deal well with stochastic functions, and therefore requires that we evaluate the total loss over all training examples on each cost function evaluation.
+L-BFGS can be run right away from the random initialization,
+but often what works better is to do a few loops of fmin_sgd to move the initial random parameters to a promising area of the search space, and then to refine that solution with L-BFGS.
# -- BATCH TRAINING
+def batch_criterion(W, b, V, c):
+ return training_criterion(W, b, V, c, x)
+
+W, b, V, c = autodiff.fmin_l_bfgs_b(batch_criterion,
+ args=(W, b, V, c),
+ maxfun=20, # -- how many function calls to allow?
+ iprint=1, # -- 1 for verbose, 0 for normal, -1 for quiet
+ m=20) # -- how well to approximate the Hessian
+
+show_filters(W.T, x_img_res, n_hidden2)
+That's it, we've trained an auto-encoder.
+An autoencoder is a function that maps a vector to approximately itself, by passing it through intermediate values in such a way that the pure identify function is difficult or impossible. +An autoencoder is a non-stochastic pre-cursur to several more modern probabilistic models such as De-Noising AutoEncoders, Sparse Coding, and Restricted Boltzmann Machines. They are a flexible class of feature extraction algorithms that can be quick to train by either stochastic gradient descent, or more sophisticated minimization algorithms such as L-BFGS.
+Spend some time running through the exercises below to get a better feel for how autoencoders work and what they can do.
+If there is one linear hidden layer (the code) and +the mean squared error criterion is used to train the network, then the $k$ +hidden units learn to project the input in the span of the first $k$ +principal components of the data. If the hidden +layer is non-linear, the auto-encoder behaves differently from PCA, +with the ability to capture multi-modal aspects of the input +distribution. The departure from PCA becomes even more important when +we consider stacking multiple encoders (and their corresponding decoders) +when building a deep auto-encoder [Hinton06]_.
+Re-inialize the weights, and try SGD on the following linear auto-encoder model.
+HINT: if you get NaN as the Value during SGD, try lowering the step size to 0, and then slowly raising it.
+def squared_error(x, x_rec):
+ """Return the squared error of approximating `x` with `x_rec`"""
+ return ((x - x_rec) ** 2).sum(axis=1)
+
+def linear_autoencoder_criterion(W, c, x):
+ hid = dot(x - c, W)
+ x_rec = dot(hid, W.T)
+ cost = squared_error(x - c, x_rec)
+ return cost.mean()
+
+print linear_autoencoder_criterion(W, c, x[:3])
+
+# Modify the SGD training code (or paste it here) to train just W and c
+# using the linear_autoencoder_criterion
+# ...
+The weight matrix $V$ of the reverse mapping may be
+optionally constrained by $V = W^{T}$, which is
+called an autoencoder with tied weights.
+This is another way to regularize the autoencoder model.
+The PCA interpretation of the autoencoder (linear autoencoder), the RBM model (algorithmically similar to the autoencoder), the K-Means, and sparse coding interpretations of the auto-encoder all demand tied weights. How do the filters come out if you use the training_criterion_tied?
def training_criterion_tied(W, b, c, x):
+ return training_criterion(W, b, W.T, c, x)
+
+# Now modify (or paste here) an optimization fragment that optimizes only (W, b, c)
+# ...
+# How is W changed by being tied to V ?
+One serious potential issue with auto-encoders is that if there is no other +constraint besides minimizing the reconstruction error, +then an auto-encoder with $n$ inputs and an +encoding of dimension at least $n$ could potentially just learn +the identity function, and fail to differentiate +test examples (from the training distribution) from other input configurations. +Surprisingly, experiments reported in [Bengio07]_ nonetheless +suggest that in practice, when trained with +stochastic gradient descent, non-linear auto-encoders with more hidden units +than inputs (called overcomplete) yield useful representations +(in the sense of classification error measured on a network taking this +representation in input). A simple explanation is based on the +observation that stochastic gradient +descent with early stopping is similar to an L2 regularization of the +parameters. To achieve perfect reconstruction of continuous +inputs, a one-hidden layer auto-encoder with non-linear hidden units +(exactly like in the above code) +needs very small weights in the first (encoding) layer (to bring the non-linearity of +the hidden units in their linear regime) and very large weights in the +second (decoding) layer. +With binary inputs, very large weights are +also needed to completely minimize the reconstruction error. Since the +implicit or explicit regularization makes it difficult to reach +large-weight solutions, the optimization algorithm finds encodings which +only work well for examples similar to those in the training set, which is +what we want. It means that the representation is exploiting statistical +regularities present in the training set, rather than learning to +replicate the identity function.
+# The claim above is that SGD already implements an L2 regularization implicitly,
+# whereas L-BFGS less so or not at all.
+# Can you detect a difference? (I'm not sure if you can or not)
+#
+# One way to explore the question is to let SGD and L-BFGS both run until the reconstruction
+# error is nearly zero. Can both algorithms get that far? Which one finds larger parameters?
+#
+# Now define a new training_criterion function that includes L2 penalty
+# (W ** 2).sum()
+#
+# How does an explicit L2 penalty affect SGD? What about L-BFGS?
+There are different ways that an auto-encoder with more hidden units +than inputs could be prevented from learning the identity, and still +capture something useful about the input in its hidden representation. +One is the addition of sparsity (forcing many of the hidden units to +be zero or near-zero), and it has been exploited very successfully +by many [Ranzato07] [Lee08].
+One of the most extreme forms of sparsity in autoencoders is known as the K-means algorithm.
+In the K-means model, the latent code hid is a vector with one 1 and the rest all 0.
+K-means is typically solved using a very efficient batch EM algorithm, but we can also
+tackle it with the same tools we've been using thus far.
How does this compare with sklearn's K-means solver?
+def softmax(x):
+ """Return the softmax of each row in x"""
+ x2 = x - x.max(axis=1)[:, newaxis]
+ ex = exp(x2)
+ return ex / ex.sum(axis=1)[:, newaxis]
+
+# helper functions -- run this once after a kernel restart
+# Re-run it after any change you make to these routines.
+
+def euclidean_distances2(X, Y):
+ """Return all-pairs squared distances between rows of X and Y
+ """
+ # N.B. sklearn.metrics.pairwise.euclidean_distances
+ # offers a more robust version of this routine,
+ # but which does things that autodiff currently does not support.
+ XX = np.sum(X * X, axis=1)[:, newaxis]
+ YY = np.sum(Y * Y, axis=1)[newaxis, :]
+ distances = np.maximum(XX - 2 * dot(X, Y.T) + YY, 0)
+ return distances
+
+def k_means_real_x(w, x):
+ xw = euclidean_distances2(x, w.T)
+ # -- This calculates a hard winner
+ hid = (xw == xw.min(axis=1)[:, newaxis]).astype(dtype)
+ # -- This calculates a soft winner
+ hid = hid + softmax(xw)
+ x_rec = dot(hid/2, w.T)
+ cost = ((x - x_rec) ** 2).sum(axis=1)
+ rval = cost.mean()
+ return rval
+
+if 1:
+ W, = autodiff.fmin_sgd(k_means_real_x,
+ args=(W,),
+ streams=streams,
+ stepsize=0.001,
+ loops=2, # -- fmin_sgd can iterate over the streams repeatedly
+ print_interval=1000,
+ )
+
+if 0:
+ W, = autodiff.fmin_l_bfgs_b(lambda W: k_means_real_x(W, x),
+ args=(W,),
+ maxfun=60, # -- how many function calls to allow?
+ iprint=1, # -- 1 for verbose, 0 for normal, -1 for quiet
+ m=20) # -- how well to approximate the Hessian
+
+
+show_filters(W.T, x_img_res, n_hidden2)
+Sorry guys, this isn't implemented yet. +So the exercise is: implement FISTA, or the (Olshausen and Field, 1996) version.
+FISTA = NotImplementedError
+# real-real Sparse Coding
+def sparse_coding_real_x(x, w, hidbias, visbias, sparse_coding_algo=FISTA):
+ # -- several sparse coding algorithms have been proposed, but they all
+ # give rise to a feature learning algorithm that looks like this:
+ hid = sparse_coding_algo(x, w)
+ x_rec = dot(hid, w.T) + visbias
+ cost = ((x - x_rec) ** 2).mean(axis=1)
+ # -- the gradient on this cost wrt `w` through the sparse_coding_algo is
+ # often ignored. At least one notable exception is the work of Karol
+ # Greggor. I feel like the Implicit Differentiation work of Drew Bagnell
+ # is another, but I'm not sure.
+ return cost, hid
+Another variation on the autoencoder theme is to add noise to the input by zeroing out some of the elements randomly. This is known as the denoising autoencoder (Vincent et al., XXX) +and it has been shown to be a form of score-matching (probabilistic model interpretation) +and a good feature extraction algorithm.
+def denoising_autoencoder_binary_x(W, b, c, x):
+ # -- corrupt the input by zero-ing out some values randomly
+ noisy_x = x * (rand(*x.shape) > .3)
+ hid = logistic(dot(noisy_x, W) + b)
+ x_rec = logistic(dot(hid, W.T) + c)
+ cost = cross_entropy(x, x_rec)
+ return cost.mean()
+
+W, b, c = autodiff.fmin_sgd(denoising_autoencoder_binary_x,
+ args=(W, b, c),
+ streams=streams,
+ stepsize=0.01,
+ loops=1, # -- fmin_sgd can iterate over the streams repeatedly
+ print_interval=1000,
+ )
+
+show_filters(W.T, x_img_res, n_hidden2)
+The Contrastive Divergence algorithm for training Restricted Boltzmann Machines (RBMs) looks a lot like an auto-encoder from an algorithmic perspective, especially the denoising autoencoder. Tempting though it is to use the autoencoder as a unifying framework, it is a stretch to call an RBM an auto-encoder. Mathematically, it is better to train an RBM by Persistent CD (aka Stochastic Maximum Likelihood) and these methods actually work on the basis of ensuring that the RBM does not perfectly reconstruct any training instances. Nevertheless, +the following code implements the CD algorithm in a way that makes a clear connection to the other autoencoder-style models above.
+One important difference with the cases above is that +CD is not in fact gradient descent on a cost function and although this code makes use of a difference in free energies as a cost function, it is just a trick. +Consequently, you'll see that the so-called "cost" values regularly jump both up and down even when the algorithm is working perfectly well. Computation of the likelihood of an RBM involves an infamously intractable partition function -- there are techniques for estimating it (see Ruslan Salakhutdinov's PhD thesis), but no tutorial here yet.
+def rbm_binary_x(w, hidbias, visbias, x):
+ hid = logistic(dot(x, w) + hidbias)
+ hid_sample = (hid > rand(*hid.shape)).astype(x.dtype)
+
+ # -- N.B. model is not actually trained to reconstruct x
+ x_rec = logistic(dot(hid_sample, w.T) + visbias)
+ x_rec_sample = (x_rec > rand(*x_rec.shape)).astype(x.dtype)
+
+ # "negative phase" hidden unit expectation
+ hid_rec = logistic(dot(x_rec_sample, w) + hidbias)
+
+ def free_energy(xx):
+ xw_b = dot(xx, w) + hidbias
+ return -log(1 + exp(xw_b)).sum(axis=1) - dot(xx, visbias)
+
+ cost = free_energy(x) - free_energy(x_rec_sample)
+ return cost.mean()
+
+W, b, c = autodiff.fmin_sgd(rbm_binary_x,
+ args=(W, b, c),
+ streams=streams,
+ stepsize=0.01,
+ loops=5, # -- fmin_sgd can iterate over the streams repeatedly
+ print_interval=1000,
+ )
+
+show_filters(W.T, x_img_res, n_hidden2)
+This notebook has been adapted from the +Deep Learning Tutorials: Convolutional Neural Network
+Convolutional Neural Networks (ConvNets) are MLP variants which are +specialized for image processing. +From Hubel and Wiesel's early work on the cat's visual cortex [Hubel68], +we know there exists a complex arrangement of cells within the visual cortex. +These cells are sensitive to small sub-regions of the input space, called a +_receptive field, and are tiled in such a way as to cover the entire visual +field. These filters are local in input space and are thus better suited to +exploit the strong spatially local correlation present in natural images.
+Additionally, two basic cell types have been identified: simple cells (S) and +complex cells (C). Simple cells (S) respond maximally to specific edge-like +stimulus patterns within their receptive field. Complex cells (C) have larger +receptive fields and are locally invariant to the exact position of the +stimulus.
+import sys
+import time
+
+import numpy as np
+from numpy import arange, dot, maximum, ones, tanh, zeros
+from numpy.random import uniform
+
+from skdata import mnist
+import autodiff
+
+from util import show_filters
+from util import hinge
+from util import ova_svm_prediction, ova_svm_cost
+from util import tanh_layer
+from util import mlp_prediction, mlp_cost
+
+# -- filterbank normalized cross-correlation
+from util import fbncc
+
+# -- max-pooling over 2x2 windows
+from util import max_pool_2d_2x2
+# -- load and prepare the data set
+#
+# -- N.B. we're loading up x as images this time, not vectors
+dtype = 'float32' # helps save memory and go faster
+n_examples = 10000
+data_view = mnist.views.OfficialImageClassification(x_dtype=dtype)
+n_classes = 10
+x_rows, x_cols = data_view.train.x.shape[1:3]
+# -- N.B. we shuffle the input to have shape
+# (#examples, #channels, #rows, #cols)
+x = data_view.train.x[:n_examples].transpose(0, 3, 1, 2)
+y = data_view.train.y[:n_examples]
+y1 = -1 * ones((len(y), n_classes)).astype(dtype)
+y1[arange(len(y)), y] = 1
+ConvNets (as well as several related computer vision architectures XXX), +exploit spatially local correlation by enforcing a local connectivity pattern between +neurons of adjacent layers. The input hidden units in the m-th layer are +connected to a local subset of units in the (m-1)-th layer, which have spatiallycontiguous receptive fields. We can illustrate this graphically as follows:
+
Imagine that layer m-1 is the input retina. +In the above, units in layer m +have receptive fields of width 3 with respect to the input retina and are thus only +connected to 3 adjacent neurons in the layer below (the retina). +Units in layer m have +a similar connectivity with the layer below. We say that their receptive +field with respect to the layer below is also 3, but their receptive field +with respect to the input is larger (it is 5). +The architecture thus +confines the learnt "filters" (corresponding to the input producing the strongest response) to be a spatially local pattern +(since each unit is unresponsive to variations outside of its receptive field with respect to the retina). +As shown above, stacking many such +layers leads to "filters" (not anymore linear) which become increasingly "global" however (i.e +spanning a larger region of pixel space). For example, the unit in hidden +layer m+1 can encode a non-linear feature of width 5 (in terms of pixel +space).
+In ConvNets, each sparse filter $h_i$ is additionally replicated across the +entire visual field. These "replicated" units form a feature map, which +share the same parametrization, i.e. the same weight vector and the same bias.
+
In the above figure, we show 3 hidden units belonging to the same feature map. +Weights of the same color are shared, i.e. are constrained to be identical. +Gradient descent can still be used to learn such shared parameters, and +requires only a small change to the original algorithm. The gradient of a +shared weight is simply the sum of the gradients of the parameters being +shared.
+Why are shared weights interesting ? Replicating units in this way allows for +features to be detected regardless of their position in the visual field. +Additionally, weight sharing offers a very efficient way to do this, since it +greatly reduces the number of free parameters to learn. By controlling model +capacity, ConvNets tend to achieve better generalization on vision problems.
+Conceptually, a feature map is obtained by convolving the input image with a +linear filter, adding a bias term and then applying a non-linear function. If +we denote the k-th feature map at a given layer as $h^k$, whose filters +are determined by the weights $W^k$ and bias $b_k$, then the +feature map $h^k$ is obtained as follows (for $tanh$ non-linearities):
+$$ + h^k_{ij} = \tanh ( (W^k * x)_{ij} + b_k ). +$$
+To form a richer representation of the data, hidden layers are composed of +a set of multiple feature maps, ${h^{(k)}, k=0..K}$. +The weights $W$ of this layer can be parametrized as a 4D tensor +(destination feature map index, source feature map index, source vertical position index, source horizontal position index) +and +the biases $b$ as a vector (one element per destination feature map index). +We illustrate this graphically as follows:
+
Figure 1: example of a convolutional layer
+Here, we show two layers of a ConvNet, containing 4 feature maps at layer (m-1) +and 2 feature maps ($h^0$ and $h^1$) at layer m. Pixels (neuron outputs) in +$h^0$ and $h^1$ (outlined as blue and red squares) are computed +from pixels of layer (m-1) which fall within their 2x2 receptive field in the +layer below (shown +as colored rectangles). Notice how the receptive field spans all four input +feature maps. The weights $W^0$ and $W^1$ of $h^0$ and +$h^1$ are thus 3D weight tensors. The leading dimension indexes the +input feature maps, while the other two refer to the pixel coordinates.
+Putting it all together, $W^{kl}_{ij}$ denotes the weight connecting +each pixel of the k-th feature map at layer m, with the pixel at coordinates +(i,j) of the l-th feature map of layer (m-1).
+Filterbank normalized cross-correlation (fbncc) implements filterbank cross-correlation (convolution with flipped filters) with the additional step of local centering and normalization (sphere-ing) of the image patch. This routine provides a very high-pass non-linear filter which is used in place of pure convolution in most modern convnets (citations needed).
To get a sense of what fbncc does, let's have a little fun with this classic meme...
+from PIL import Image
+img = Image.open('images/3wolfmoon.jpg')
+img = np.asarray(img, dtype='float32') / 255.
+
+imshow(img)
+
+# put image in 4D tensor of shape (1, 3, height, width)
+img_ = img.swapaxes(0, 2).swapaxes(1, 2).reshape(1, 3, 639, 516)
+
+rfilters = uniform(size=(4, 3, 5, 5)).astype('float32')
+out = fbncc(img_, rfilters)
+figure()
+show_filters(out.reshape((4, -1)), out.shape[2:], (1, 4))
+Notice that a randomly initialized filter acts very much like an edge detector!
+ConvNets emulate the behaviour of "complex cells" in visual cortex by +max-pooling, which is a form of +nonlinear downsampling. Max-pooling partitions the input image into +a set of non-overlapping rectangles and, for each such sub-region, outputs the +maximum value. +Max-pooling is useful in vision for two reasons: (1) it reduces the +computational complexity for upper layers and (2) it provides a form of +translation invariance. To understand the invariance argument, imagine +cascading a max-pooling layer with a convolutional layer. There are 8 +directions in which one can translate the input image by a single pixel. If +max-pooling is done over a 2x2 region, 3 out of these 8 possible +configurations will produce exactly the same output at the convolutional +layer. For max-pooling over a 3x3 window, this jumps to 5/8.
+Since it provides additional robustness to position, max-pooling is thus a +"smart" way of reducing the dimensionality of intermediate representations.
+Max-pooling over the standard 2x2 window is implemented by util.max_pool_2d_2x2.
+It actually uses Theano's implementation of max-pooling internally, there is no native numpy implementation of max-pooling.
# -- similar to tanh_layer, but we used fbncc instead of dot
+def tanh_conv_layer(W_fb, b_fb, img4):
+ activation = fbncc(img4, W_fb) + b_fb
+ activation = max_pool_2d_2x2(activation)
+ return np.tanh(activation)
+A ConvNet architecture interleaves convolutional and pooling layers to produce +the input to an MLP.
+
From an implementation point of view, this means lower-layers operate on 4D +tensors. These are then flattened to a 2D matrix of rasterized feature maps, +to be compatible with our previous MLP implementation.
+# -- allocate just one convolutional layer to keep the code simpler
+n_filters = 16
+patch_height = 5
+patch_width = 5
+patch_size = patch_height * patch_width
+
+W_fb = uniform(
+ low=-np.sqrt(6.0 / (patch_size + n_filters)),
+ high=np.sqrt(6.0 / (patch_size + n_filters)),
+ size=(n_filters, 1, patch_height, patch_width),
+ ).astype(x.dtype)
+b_fb = zeros((
+ n_filters,
+ x_rows - patch_height + 1,
+ x_cols - patch_width + 1),
+ dtype=x.dtype)
+
+# -- allocate tanh layer parameters
+n_hidden = 100
+
+V = uniform(low=-np.sqrt(6.0 / (b_fb.size // 4 + n_hidden)),
+ high=np.sqrt(6.0 / (b_fb.size // 4 + n_hidden)),
+ size=(b_fb.size // 4, n_hidden)).astype(x.dtype)
+c = zeros(n_hidden, dtype=x.dtype)
+
+# -- allocate the SVM at the top
+W = zeros((n_hidden, n_classes), dtype=x.dtype)
+b = zeros(n_classes, dtype=x.dtype)
+
+# -- define the ConvNet model
+def convnet_features(W_fb, b_fb, V, c, x):
+ layer1 = tanh_conv_layer(W_fb, b_fb, x)
+ layer1_size = np.prod(layer1.shape[1:])
+ layer2 = tanh_layer(V, c,
+ np.reshape(layer1, (x.shape[0], layer1_size)))
+ return layer2
+
+def convnet_prediction(W_fb, b_fb, V, c, W, b, x):
+ layer2 = convnet_features(W_fb, b_fb, V, c, x)
+ return ova_svm_prediction(W, b, layer2)
+
+def convnet_cost(W_fb, b_fb, V, c, W, b, x, y1):
+ layer2 = convnet_features(W_fb, b_fb, V, c, x)
+ return ova_svm_cost(W, b, layer2, y1)
+
+print 'Initial ConvNet cost:', convnet_cost(W_fb, b_fb, V, c, W, b, x[:3], y1[:3])
+Also of note, remark that we use the same weight initialization formula as +with the MLP. Weights are sampled randomly from a uniform distribution in the +range [-1/$D_0$, 1/$D_0$], where $D_0$ is the number of inputs to a hidden +unit. For MLPs, this was the number of units in the layer below. For ConvNets +however, we have to take into account the number of input feature maps and the +size of the receptive fields.
+Training of a ConvNet by supervised learning is done with exactly the same techniques as we used for the linear SVM and the MLP. We will look at semi-unsupervised training strategies in upcoming tutorials.
+# SGD minimization
+
+# -- do n_online_loops passes through the data set doing SGD
+# This can be faster at the beginning than L-BFGS
+t0 = time.time()
+online_batch_size = 1
+n_batches = n_examples / online_batch_size
+W_fb, b_fb, V, c, W, b = autodiff.fmin_sgd(convnet_cost, (W_fb, b_fb, V, c, W, b),
+ streams={
+ 'x': x.reshape((n_batches, online_batch_size,) + x.shape[1:]),
+ 'y1': y1.reshape((n_batches, online_batch_size, y1.shape[1]))},
+ loops=2,
+ stepsize=0.01,
+ print_interval=1000,
+ )
+print 'SGD took %.2f seconds' % (time.time() - t0)
+show_filters(W_fb.reshape(n_filters, 5 * 5).T, (4, 4), (2, 8))
+Testing of a trained model is also done in much the same way we tested the SVM and MLP models.
+train_predictions = convnet_prediction(W_fb, b_fb, V, c, W, b, x)
+train_errors = y != train_predictions
+print 'Current train set error rate', np.mean(train_errors)
+
+test_predictions = convnet_prediction(W_fb, b_fb, V, c, W, b, data_view.test.x[:].transpose(0, 3, 1, 2))
+test_errors = data_view.test.y[:] != test_predictions
+print 'Current test set error rate', np.mean(test_errors)
+The current hyperparameters of the convnet are not great, and the test set error is around 4%. This model should be able to score around 1% on the test set. Try adjusting hyperparameters such as:
+How all these things affect training and test error?
+Feel free to send a pull-request back to the ipam-tutorials github page if you find good hyperparameter values!
+ConvNets in the literature have often been designed with not one, but two convolutional layers. Modify the convnet_features function to use two convolutional layers. Caveat: this will involve a few annoying reshapes, resizes, etc.
+In 1_1_getting started, we looked at three data sets in skdata: MNIST, CIFAR-10, and SVHN. The sample code above uses MNIST - try one of the other data sets and adapt the code here to run. You'll need to change the allocation of the first-layer filters to deal with the color images, and the size of the b_fb biases to match the larger input image dimensions, but I think that's it ...
ConvNets are especially tricky to train, as they add even more hyper-parameters than +a standard MLP. While the usual rules of thumb for learning rates andregularization constants still apply, the following should be kept in mind when +optimizing ConvNets.
+When choosing the number of filters per layer, keep in mind that computing the +activations of a single convolutional filter is much more expensive than with +traditional MLPs!
+Assume layer $(l-1)$ contains $K^{l-1}$ feature +maps and $M \times N$ pixel positions (i.e., +number of positions times number of feature maps), +and there are $K^l$ filters at layer $l$ of shape $m \times n$. +Then computing a feature map (applying an $m \times n$ filter +at all $(M-m) \times (N-n)$ pixel positions where the +filter can be applied) costs $(M-m) \times (N-n) \times m \times n \times K^{l-1}$. +The total cost is $K^l$ times that. Things may be more complicated if +not all features at one level are connected to all features at the previous one.
+For a standard MLP, the cost would only be $K^l \times K^{l-1}$ +where there are $K^l$ different neurons at level $l$. +As such, the number of filters used in ConvNets is typically much +smaller than the number of hidden units in MLPs and depends on the size of the +feature maps (itself a function of input image size and filter shapes).
+Since feature map size decreases with depth, layers near the input layer will tend to +have fewer filters while layers higher up can have much more. In fact, to +equalize computation at each layer, the product of the number of features +and the number of pixel positions is typically picked to be roughly constant +across layers. To preserve the information about the input would require +keeping the total number of activations (number of feature maps times +number of pixel positions) to be non-decreasing from one layer to the next +(of course we could hope to get away with less when we are doing supervised +learning). The number of feature maps directly controls capacity and so +that depends on the number of available examples and the complexity of +the task.
+Common filter shapes found in the litterature vary greatly, usually based on +the dataset. Best results on MNIST-sized images (28x28) are usually in the 5x5range on the first layer, while natural image datasets (often with hundreds of pixels in each +dimension) tend to use larger first-layer filters of shape 12x12 or 15x15.
+The trick is thus to find the right level of "granularity" (i.e. filter +shapes) in order to create abstractions at the proper scale, given a +particular dataset.
+Typical values are 2x2 or no max-pooling. Very large input images may warrant +4x4 pooling in the lower-layers. Keep in mind however, that this will reduce the +dimension of the signal by a factor of 16, and may result in throwing away too +much information.
+Some quick hacking with Terry Stewart produced a Theano implementation of LIF neurons. He has added a more library-oriented version of this approach into Nengo, but this hacky code is probably still easier to read and play with.
+# -- Initial imports
+import numpy as np
+import theano
+import theano.tensor as TT
+from theano import shared, function
+# -- Set up our simulation to run with a particular precision
+
+floatX = theano.config.floatX # either 'float32' or 'float64'
+
+def sharedX(obj, name):
+ """Create shared variable for ndarray'd obj with dtype=floatX
+ """
+ return shared(np.asarray(obj, dtype=floatX))
+# -- Assign simulation parameters
+# Written in this way these will be *compiled in* to the
+# tick() function below, as Theano *contants*.
+
+N = 100 # -- number of neurons
+D = 3 # -- state vector dimensionality
+
+t_rc = 0.01 # -- some capacitance constant
+t_ref = 0.001 # -- refractory period (1/max spike rate?)
+dt = 0.001 # -- simulation time-step
+pstc = 0.05 # -- post-synaptic time constant (low-pass filter on spikes in state update)
+
+encoders = np.random.randn(N, D).astype(floatX)
+decoders = np.random.randn(N, D).astype(floatX)
+# -- Define the recurrence of the LIF model
+
+# -- Place persistent quantities into shared variables
+state = sharedX(np.zeros(D), 'state')
+voltage = sharedX(np.zeros(N), 'voltage')
+refractory_time = sharedX(np.zeros(N), 'refractory_time')
+J_bias = sharedX(np.random.randn(N), 'J_bias')
+Jm_prev = sharedX(np.zeros(N), 'Jm_prev')
+
+state_w_bias = TT.set_subtensor(state[0], 0.5)
+
+Jm = TT.dot(encoders, state) + J_bias
+
+# -- Euler's method
+dV = dt / t_rc * (Jm_prev - voltage)
+new_v = TT.maximum(voltage + dV, 0)
+
+post_ref = 1.0 - (refractory_time - dt) / dt
+
+# -- Do accurate timing for when the refractory period ends (probably not needed for FPGA neuron model)
+new_v *= TT.clip(post_ref, 0, 1)
+
+V_threshold = 1
+
+spiked = TT.switch(new_v > V_threshold, 1.0, 0.0)
+
+# -- Refractory stuff
+overshoot = (new_v - V_threshold) / dV
+spiketime = dt * (1.0 - overshoot)
+new_refractory_time = TT.switch(spiked,
+ spiketime + t_ref,
+ refractory_time - dt)
+
+# -- low-pass filter on state
+decay = float(np.exp(-dt / pstc))
+new_state = (state * decay
+ + (1 - decay) * TT.dot(spiked, decoders) / dt)
+# -- Compile a function to evaluate one time-step of the model
+
+tick = function([], [],
+ updates={
+ voltage: new_v * (1 - spiked),
+ state: new_state,
+ refractory_time: new_refractory_time,
+ Jm_prev: Jm})
+# -- Step through time, retrieving the voltage value.
+voltages = []
+for i in range(1000):
+ tick()
+ voltages.append(voltage.get_value())
+lines = plot(np.asarray(voltages).T)
+The linear support vector machine (SVM) is a linear classifier parametrized by a matrix of weights $W$ and a bias vector $\mathbf{b}$. We will develop the multiclass "One vs. All" linear support vector machine, which is a concatentation of two-class support vector machines. We will suppose that our label set is the non-negative integers up to some maximum L.
+There are other ways of setting up a multi-class SVM, such using the Crammer-Singer loss or +LaRank. +Despite theoretical limitations (lack of consistency!?) the approach is widely used. +See John Langford's page on machine learning reductions +for more perspectives on multi-class SVMs.
+Deep architectures for classification are typically formed by classifying learned features with a +linear classifier such as an SVM or logistic regression model. +We will begin our exploration of deep architectures with the shallowest model of all: the linear SVM.
+The Deep Learning Tutorials provide a good complementary introduction to +logistic regression.
+# initialize the workspace by importing several symbols
+import logging
+import sys
+import time
+
+import numpy as np
+from numpy import arange, dot, maximum, ones, tanh, zeros
+from numpy.random import randn
+
+from skdata import mnist
+import autodiff
+
+try:
+ from util import show_filters
+except:
+ print 'WARNING: show_filters will not work'
+ def show_filters(*a, **kw): pass
+# -- load and prepare the data set (even download if necessary)
+dtype = 'float32'
+n_examples = 50000
+n_classes = 10 # -- denoted L in the math expressions
+img_shape = (28, 28)
+
+data_view = mnist.views.OfficialVectorClassification(x_dtype=dtype)
+x = data_view.all_vectors[data_view.fit_idxs[:n_examples]]
+y = data_view.all_labels[data_view.fit_idxs[:n_examples]]
+Supposing that our input is a vector (generally a feature vector) $\mathbf{x}$, the prediction $l^*$ of the OVA-SVM is the argmax of the affine function:
+$$ +l^ = \operatorname{argmax}_{l=0}^{L} (\mathbf{x} W_l + \mathbf{b}_l) +$$
+def ova_svm_prediction(W, b, x):
+ return np.argmax(np.dot(x, W) + b, axis=1)
+The most standard way of training an SVM is to maximize the margin on training data. +The margin is always defined in the two-class case (where $y \in \{-1, +1\}$) +as $y * (xW + b)$. +The affine function parametrized by $W$ and $b$ defines a hyper-plane where $\mathbf{x}W + b = 0$ that the SVM interprets as a decision surface. +The margin represents how far away $\mathbf{x}W+b$ is from being on the wrong side of the decision surface: large positive values mean there is a safe distance ("margin of error?"), negative values mean that the decision surface is actually incorrect for the given $\mathbf{x}$.
+The most standard sense in which SVMs maximize margin is via the hinge loss. The hinge loss of margin value $u$ is defined as
+$$ +\mathrm{hinge}(u) = \max(0, 1-u) +$$
+By maximizing the average hinge loss of the margin of training examples, we maximize a tight convex upper bound on the mis-classification rate (zero-one loss), and can +get a good binary classifier using fast algorithms for convex optimization.
+def hinge(u):
+ return np.maximum(0, 1 - u)
+
+ugrid = np.arange(-5, 5, .1)
+plot(ugrid, hinge(ugrid), label='hinge loss')
+plot(ugrid, ugrid < 0, label='zero-one loss')
+legend()
+In the multiclass case, it is not clear what the "correct" margin definition should be, and several useful ones have been proposed. +For the rest of this tutorial we'll develop the "one vs. all" multiclass SVM, sometimes called an OVA-SVM.
+In the OVA-SVM, the training objective is defined by $L$ independent SVMs. +We will train an OVA-SVM by converting the integer-valued labels +$y$ to $Y \in \{-1, +1\}^{N \times L}$ and training $L$ SVMs at once. +The $L$ columns of $Y$ represent binary classification tasks. +The $L$ columns of $W$ and $L$ elements of $\mathbf{b}$ store the parameters of the $L$ SVMs.
+The [unregularized] training objective is:
+$$ +\mathcal{L}(\mathcal{D}; W, \mathbf{b}) = +\frac{1}{N} +\sum_{(\mathbf{x}^{(i)}, Y^{(i)}) \in \mathcal{D}} +~ +\sum_{l=1}^{L} +~ +\max \left( 0, 1 - Y^{(i)}_l (\mathbf{x}^{(i)} W_l + \mathbf{b}_l) \right) +$$
+# -- prepare a "1-hot" version of the labels, denoted Y in the math
+y1 = -1 * ones((len(y), n_classes)).astype(dtype)
+y1[arange(len(y)), y] = 1
+def ova_svm_cost(W, b, x, y1):
+ # -- one vs. all linear SVM loss
+ margin = y1 * (dot(x, W) + b)
+ cost = hinge(margin).mean(axis=0).sum()
+ return cost
+The training itself consists in minimizing the objective $\mathcal{L}(\mathcal{D}; W, b)$ with respect to $W$ and $b$. The criterion is convex, so it doesn't much matter where we start. Zero works well in practice.
+# initialize the model
+W = zeros((x.shape[1], n_classes), dtype=dtype)
+b = zeros(n_classes, dtype=dtype)
+There are many specialized SVM solvers available, but they tend to be most helpful for non-linear SVMs. In the linear case a simple combination of stochastic gradient descent (SGD) and BFGS can be very effective.
+# -- do n_online_loops passes through the data set doing SGD
+# This can be faster at the beginning than L-BFGS
+t0 = time.time()
+online_batch_size = 1
+n_online_epochs = 1
+n_batches = n_examples / online_batch_size
+W, b = autodiff.fmin_sgd(ova_svm_cost, (W, b),
+ streams={
+ 'x': x.reshape((n_batches, online_batch_size, x.shape[1])),
+ 'y1': y1.reshape((n_batches, online_batch_size, y1.shape[1]))},
+ loops=n_online_epochs,
+ step_size=0.001,
+ print_interval=1000,
+ )
+print 'SGD took %.2f seconds' % (time.time() - t0)
+show_filters(W.T, img_shape, (2, 5))
+# -- L-BFGS optimization of our SVM cost.
+
+def batch_criterion(W, b):
+ return ova_svm_cost(W, b, x, y1)
+
+W, b = autodiff.fmin_l_bfgs_b(batch_criterion, (W, b), maxfun=20, m=20, iprint=1)
+
+print 'final_cost', batch_criterion(W, b)
+# -- N. B. the output from this command comes from Fortran, so iPython does not see it.
+# To monitor progress, look at the terminal from which you launched ipython
+show_filters(W.T, img_shape, (2, 5))
+Once the a classifier has been trained, we can test it for generalization accuracy on the test set. We test it by making predictions for the examples in the test set and counting up the number of classification mistakes.
+train_predictions = ova_svm_prediction(W, b, x)
+train_errors = y != train_predictions
+print 'Current train set error rate', np.mean(train_errors)
+
+test_predictions = ova_svm_prediction(W, b, data_view.all_vectors[data_view.tst_idxs])
+test_errors = data_view.all_labels[data_view.tst_idxs] != test_predictions
+print 'Current test set error rate', np.mean(test_errors)
+The linear Support Vector Machine (SVM) is an accurate and quick-to-train linear classifier. +We looked a multiclass variant called a "One vs. All" SVM, +parameterized by a matrix $W$ of weights and a vector $\mathbf{b}$ of biases. +A linear SVM can be trained by gradient descent; +whether SGD or L-BFGS works faster depends on the size of the data set and accuracy desired from the minimization process. A few iterations of SGD followed by refinement by L-BFGS is a fairly robust strategy.
+Typically linear SVMs are L2-regularized. That means that in addition to $\mathcal{L}$, we minimize the sum of the squared elements of $W$:
+$$ +\mathcal{L}_{\ell_2}(\mathcal{D}; W, b) = \mathcal{L}(\mathcal{D}; W, b) + \alpha ||W||_2^2 +$$
+def ova_svm_cost_l2(W, b, x):
+ raise NotImplementedError('implement me')
+
+# re-run the SGD and LBFGS training fragments after filling
+# in this function body to implement an L2-regularized SVM.
+
+# In cases where the training data are linearly separable, this can be very important.
+# How big does \alpha have to be to make any difference?
+Although the optimization of an SVM is a convex problem, the efficiency and stopping criteria of both SGD and L-BFGS are generally improved by centering and normalizing each input feature beforehand. The procedure is simple: subtract off the training set mean of each feature and divide by the standard deviation.
+How does this affect the behavior of SGD and L-BFGS?
+How should you deal with features whose standard deviation is very small? What about when it is 0?
+We have been using MNIST so far. How good is a linear classifier on other datasets that are typically used in deep learning?
+The following two code fragments bring in access to the CIFAR-10 and SVHN data sets.
+You'll have to access the shape attribute of the training images to figure out how many visible units there are, and then re-allocate W before repeating the training and testing processes.
from skdata import cifar10
+n_examples = 10000
+# -- RECOMMENDED: restart the IPython kernel to clear out memory
+data_view = cifar10.view.OfficialVectorClassification(x_dtype='float32', n_train=n_examples)
+
+# Write the rest of the answer here:
+# - define x and y for training
+
+# Repeat the training steps in the code fragments above to train a linear SVM for CIFAR10
+from skdata import svhn
+n_examples = 10000
+# -- RECOMMENDED: restart the IPython kernel to clear out memory
+data_view = svhn.view.CroppedDigitsView2(x_dtype='float32', n_train=n_examples)
+
+# Write the rest of the answer here:
+# - define x and y for training
+
+# Repeat the training steps in the code fragments above to train a linear SVM for
+# Street View House Numbers (SVHN)
+Bonus: Pixel colors in cifar10 and svhn are encoded as RGB triples. There are many other ways of representing color. Some of these are linearly related to RGB, others non-linearly. +The scikit-image project defines a number of color-conversion routines. +Are any of these color representations better than RGB for image classification?
+from skimage.color import convert_colorspace
+
+# Write your answer here
+# use convert_colorspace to define new versions of the training and testing images.
+
+# Hint - you'll have to un-rasterize the x matrix into a N-dimensional array
+# n. examples x n. rows x n. cols x n. channels
+You don't have to use the hinge loss to train a linear classifier. Other popular loss functions include the squared hinge loss, the "Huberized" hinge loss, and the exponential function. These functions are continuously differentiable, which you might expect to help for gradient-based optimization. Does it help? How else are these functions different? In what cases would it matter?
+# Write your answer here
+TODO: VERIFY THIS QUESTION
+The One vs. All SVM suffers, in theory, from a lack of Bayes consistency. +The Crammer-Singer loss defines the multiclass training objective differently, in a way that is Bayes-consistent.
+To define the Crammer-Singer loss let's define +$\mathbf{h} = \mathbf{x}W + \mathbf{b}$, +and look at $\mathbf{m}$ where +$\mathbf{m}_j = 1 - \mathbf{h}_l - \mathbf{h}_j $ +for all elements $j$ that are not the correct label $l$.
+The Crammer-Singer objective is to minimize the expectation over the dataset of +$\sum_j max(0, \mathbf{m}_{j})$.
+Can you code this up as a training criterion and use it?
+HINT: Start with the case of pure stochastic gradient descent, I think the batch case requires numpy's advanced indexing which not currently supported by autodiff.
# Write your answer here
+This tutorial illustrates how to fit a ridge regression (aka $\ell_2$-regularized logistic regression) model to some dummy data, using Theano.
+TODO: bring in the content from Deep Learning Tutorial
+# Initial imports
+import numpy as np
+import theano.tensor as T
+from theano import shared, function
+rng = np.random.RandomState(123)
+# Create a sample logistic regression problem.
+true_w = rng.randn(100)
+true_b = rng.randn()
+xdata = rng.randn(50, 100)
+ydata = (np.dot(xdata, true_w) + true_b) > 0.0
+# Step 1. Declare Theano variables
+x = T.dmatrix()
+y = T.dvector()
+w = shared(rng.randn(100))
+b = shared(numpy.zeros(()))
+print "Initial model"
+print w.get_value()
+print b.get_value()
+# Step 2. Construct Theano expression graph
+p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))
+xent = -y * T.log(p_1) - (1 - y) * T.log(1 - p_1)
+prediction = p_1 > 0.5
+cost = xent.mean() + 0.01 * (w ** 2).sum()
+gw, gb = T.grad(cost, [w, b])
+# Step 3. Compile expressions to functions
+train = function(inputs=[x, y],
+ outputs=[prediction, xent],
+ updates={w:w - 0.1 * gw,
+ b:b - 0.1 * gb})
+# Step 4. Perform computation
+for loop in range(100):
+ pval, xval = train(xdata, ydata)
+ print xval.mean()
+In the previous tutorial on linear SVMs, we looked at how training a model +consists in defining a training objective, and then minimizing it with respect +to model parameters by either stochastic or batch gradient descent.
+In this tutorial we will augment our SVM with internal layers of hidden +units and turn our linear classifier into a multi-layer architecture. +An MLP with one internal layer is sometimes called a shallow architecture, +in contrast to an MLP with more internal layers which is sometimes called a deep architecture.
+An MLP can be viewed as an SVM, where the input $\mathbf{x}$ is first transformed using a +learnt non-linear vector-valued transformation $\Phi$. +The purpose of this transformation is to map the input data into a space where it becomes linearly separable. +The vector $\Phi(\mathbf{x})$ is referred to as a hidden layer.
+This tutorial will again tackle the problem of MNIST digit classification. +XXX
+import logging
+import sys
+import time
+
+import numpy as np
+from numpy import arange, dot, maximum, ones, tanh, zeros
+from numpy.random import uniform
+
+from skdata import mnist
+import autodiff
+
+from util import show_filters
+
+from util import hinge
+from util import ova_svm_prediction
+from util import ova_svm_cost
+# -- load and prepare the data set (even download if necessary)
+dtype = 'float32'
+n_examples = 10000
+n_classes = 10 # -- denoted L in the math expressions
+img_shape = (28, 28)
+
+data_view = mnist.views.OfficialVectorClassification(x_dtype=dtype)
+x = data_view.train.x[:n_examples]
+y = data_view.train.y[:n_examples]
+# -- prepare a "1-hot" version of the labels, denoted Y in the math
+y1 = -1 * ones((len(y), n_classes)).astype(dtype)
+y1[arange(len(y)), y] = 1
+Typically the function $\Phi$ is taken to be the function +$\mathbb{R}^{D_0} \rightarrow (-1, 1)^{D_1}$
+$$ +\Phi(\mathbf{x}; V, \mathbf{c}) = \tanh( \mathbf{x} V + \mathbf{c}) +$$
+in which $V \in \mathbb{R}^{D_0 \times D_1}$ is called a weight matrix, and +$\mathbf{c} \in \mathbb{R}^{D_1}$ is called a bias vector. +The integer $D_0$ is the number of elements in $\mathbf{x}$ (sometimes, number +of "input units"). +The integer $D_1$ is the number of "hidden units" of the MLP. +We abuse notation slightly here by using $\tanh(\mathbf{u})$ for a vector +$\mathbf{u}$ to denote +the vector of values $\tanh(\mathbf{u}i)$. +Sometimes other non-linear scalar functions are used instead of the tanh +function -- whichever one is used is called the _activation function of +layer $\Phi$.
+def tanh_layer(V, c, x):
+ return np.tanh(np.dot(x, V) + c)
+When combined with an SVM classifier (recall previous tutorial), the full classification model can be written
+$$ + \mathrm{MLP}(\mathbf{x}) = \mathrm{SVM}\left(\Phi(\mathbf{x}; V, \mathbf{c})); W, \mathbf{b} \right) +$$
+This sort of MLP (or Artificial Neural Network - ANN) with a single hidden layer +is sometimes represented graphically as follows:
+
A single hidden layer of this form is sufficient to make the MLP a universal approximator. +However we will see shortly +that there are benefits to using many such hidden layers (deep learning). +See these course notes for an +introduction to MLPs, the back-propagation algorithm, and how to train MLPs.
+# define prediction function
+def mlp_prediction(V, c, W, b, x):
+ h = tanh_layer(V, c, x)
+ return ova_svm_prediction(W, b, h)
+
+def mlp_cost(V, c, W, b, x, y1):
+ h = tanh_layer(V, c, x)
+ return ova_svm_cost(W, b, h, y1)
+To train an MLP, we learn all parameters of the model ($W, \mathbf{b}, V, +\mathbf{c}$) by gradient descent, +just as we learned the parameters $W, \mathbf{b}$ previously when training the +SVM.
+The initial values for the weights of a hidden layer ($V$) should be uniformly +sampled from a symmetric interval that depends on the activation function. +For the tanh activation function results obtained in [Xavier10]_ show that the +interval should be +$$ +\left[ -\sqrt{\frac{6}{D_0 + D_1}}, \sqrt{\frac{6}{D_0 + D_1}} \right] +$$
+For the logistic sigmoid function $1 / (1 + e^{-u})$ the interval is slightly +different:
+$$ +\left[ -4\sqrt{\frac{6}{D_0 + D_1}},4\sqrt{\frac{6}{D_0 + D_1}} \right] +$$.
+This initialization ensures that at least early in training, each neuron operates in a regime of its activation function where information can easily be propagated both upward (activations flowing from inputs to outputs) and backward (gradients flowing from outputs to inputs).
+# initialize input layer parameters
+n_inputs = 784 # -- aka D_0
+n_hidden = 100 # -- aka D_1
+
+V = uniform(low=-np.sqrt(6.0 / (n_inputs + n_hidden)),
+ high=np.sqrt(6.0 / (n_inputs + n_hidden)),
+ size=(n_inputs, n_hidden)).astype(x.dtype)
+c = zeros(n_hidden, dtype=x.dtype)
+
+# now allocate the SVM at the top
+W = zeros((n_hidden, n_classes), dtype=x.dtype)
+b = zeros(n_classes, dtype=x.dtype)
+As before, we train this model using stochastic gradient descent with
+mini-batches. The difference is that we modify the cost function to include the
+regularization term. L1_reg and L2_reg are the hyperparameters
+controlling the weight of these regularization terms in the total cost function.
# SGD minimization
+
+# -- do n_online_loops passes through the data set doing SGD
+# This can be faster at the beginning than L-BFGS
+t0 = time.time()
+online_batch_size = 1
+n_batches = n_examples / online_batch_size
+V, c, W, b = autodiff.fmin_sgd(mlp_cost, (V, c, W, b),
+ streams={
+ 'x': x.reshape((n_batches, online_batch_size, x.shape[1])),
+ 'y1': y1.reshape((n_batches, online_batch_size, y1.shape[1]))},
+ loops=5,
+ stepsize=0.01,
+ print_interval=1000,
+ )
+print 'SGD took %.2f seconds' % (time.time() - t0)
+show_filters(V.T, img_shape, (10, 10))
+# L-BFGS minimization
+
+def batch_criterion(V, c, W, b):
+ return mlp_cost(V, c, W, b, x, y1)
+
+V, c, W, b = autodiff.fmin_l_bfgs_b(batch_criterion, (V, c, W, b), maxfun=20, m=20, iprint=1)
+
+print 'final_cost', batch_criterion(V, c, W, b)
+# -- N. B. the output from this command comes from Fortran, so iPython does not see it.
+# To monitor progress, look at the terminal from which you launched ipython
+show_filters(V.T, img_shape, (10, 10))
+Once the a classifier has been trained, we can test it for generalization accuracy on the test set. We test it by making predictions for the examples in the test set and counting up the number of classification mistakes.
+train_predictions = mlp_prediction(V, c, W, b, x)
+train_errors = y != train_predictions
+print 'Current train set error rate', np.mean(train_errors)
+
+test_predictions = mlp_prediction(V, c, W, b, data_view.test.x[:])
+test_errors = data_view.test.y[:] != test_predictions
+print 'Current test set error rate', np.mean(test_errors)
+Try the following exercises out to get a better feel for training MLPs.
+It might seem more natural to initialize all of the MLP parameters to $0$. +What goes wrong when you do this?
+Try running L-BFGS starting from the original random initialization -- how good does it get on train and test, and how long does it take to find a solution.
+Now repeat that with various amounts of SGD before the L-BFGS. What happens?
+Now raise the number of examples seen per iteration of SGD. How does this affect the convergence of SGD and its value as a pre-conditioner?
+Regularization
+There are several hyper-parameters in the above code, which are not (and, +generally speaking, cannot be) optimized by gradient descent. +The design of outer-loop algorithms for optimizing them is a topic of ongoing +research. +Over the last 25 years, researchers have devised various rules of thumb for choosing them. +A very good overview of these tricks can be found in Efficient +BackProp by Yann LeCun, +Leon Bottou, Genevieve Orr, and Klaus-Robert Mueller. Here, we summarize +the same issues, with an emphasis on the parameters and techniques that we +actually used in our code.
+Which non-linear activation function should you use in a neural network? +Two of the most common ones are the logistic sigmoid and the tanh functions. +For reasons explained in Section 4.4, nonlinearities that +are symmetric around the origin are preferred because they tend to produce +zero-mean inputs to the next layer (which is a desirable property). +Empirically, we have observed that the tanh has better convergence properties.
+At initialization we want the weights to be small enough around the origin +so that the activation function operates near its linear regime, where gradients are +the largest. Otherwise, the gradient signal used for learning is attenuated by +each layer as it is propagated from the classifier towards the inputs. +Other desirable properties, especially for deep networks, +are to conserve variance of the activation as well as variance of back-propagated gradients from layer to layer. +This allows information to flow well upward and downward in the network and +reduces discrepancies between layers. +The initialization used above represents a good compromise between these two +constraints. +For mathematical considerations, please refer to [Xavier10]_.
+Optimization by stochastic gradient descent is very sensitive to the step size or learning rate. +There is a great deal of literature on how to choose a the learning rate, and how to change it during optimization. +The simplest solution is to use a constant rate. Rule of thumb: try +several log-spaced values ($10^{-1}, 10^{-2}, \ldots$) and narrow the +(logarithmic) grid search to the region where you obtain the lowest +validation error.
+Decreasing the learning rate over time can help a model to settle down into a +[local] minimum. +One simple rule for doing that is $\frac{\mu_0}{1 + d\times t}$ where +$\mu_0$ is the initial rate (chosen, perhaps, using the grid search +technique explained above), $d$ is a so-called "decrease constant" +which controls the rate at which the learning rate decreases (typically, a +smaller positive number, $10^{-3}$ and smaller) and $t$ is the epoch/stage.
+Section 4.7 details +procedures for choosing a learning rate for each parameter (weight) in our +network and for choosing them adaptively based on the error of the classifier.
+The number of hidden units that gives best results is dataset-dependent. +Generally speaking, the more complicated the input distribution is, the more capacity the network +will require to model it, and so the larger the number of hidden units thatwill be needed (note that the number of weights in a layer, perhaps a more direct +measure of capacity, is $D_0\times D_1$ (recall $D_0$ is the number of inputs and +$D_1$ is the number of hidden units).
+Unless we employ some regularization scheme (early stopping or L1/L2 +penalties), a typical number of hidden units vs. generalization performance graph will be U-shaped.
+Typical values to try for the L1/L2 regularization parameter $\lambda$ are $10^{-2}, 10^{-3}, \ldots$. +It can be useful to regularize the topmost layers in an MLP (closest +to and including the classifier itself) to prevent them from overfitting noisy +hidden layer features, and to encourage the features themselves to be more +discriminative.
+This "notebook" is essentially a static page describing the notational conventions used throughout the tutorials. At some point it might migrate to a normal webpage, but for now it is here, being served up by IPython itself.
+We label data sets as $\mathcal{D}$. +When the distinction is important, we +indicate train, validation, and test sets as:
+We only ever train on training data, but we measure performance on validation data as a way of choosing the best hyperparameters (model selection). +Testing data is only used for estimating generalization after model selection and training.
+The tutorials mostly deal with classification problems, where each data set +$\mathcal{D}$ is an indexed set of pairs $(x^{(i)},y^{(i)})$. We +use superscripts to distinguish training set examples: +$x^{(i)} \in \mathcal{R}^D$ is thus the i-th training example of dimensionality $D$. +Similarly, +$y^{(i)} \in {0, ..., L}$ is the i-th label assigned to input +$x^{(i)}$. It is straightforward to extend these examples to +ones where $y^{(i)}$ has other types (e.g. Gaussian for regression, +or groups of multinomials for predicting multiple symbols).
+Expressions that appear in fixed-width font like a = b + c are meant to be
+ interpreted as Python code, whereas expressions appearing in italics like $a
+ = b + c$ are meant to be interpreted as math notation.
In math notation, non-bold lower-case symols denote scalars.
+In math notation, bold lower-case symbols like $\mathbf{x}$ and + $\mathbf{h}$ denote vectors.
+In math notation, upper-case symbols like $W$ and $V$ typically denote + matrices.
+In the LaTeX-powered "math" expressions the notation $[a, b]$ denotes
+ the continuous inclusive range of real-valued numbers $u$ satisfying
+ $0 \leq x \leq 1$. In contrast, the Python syntax [a, b] denotes a
+ list data structure of two elements.
Python and NumPy tend to favor C-style "row-major" matrices, and we + will follow that convention in the math sections too. + Consequently, vector-matrix products will typically be written with + the vector on the left and the matrix on the right, as in + $\mathbf{h}V$.
+Unsupervised learning algorithms have often tackled image modeling by first modeling small image patches. +This tutorial walks through the process of learning features from raw image patches, and then introduces a whitening algorithm that is often used to pre-process image patches before learning. +Whitening is biologically plausible, gives rise to physiologically normal first layer features, and typically yields better supervised performance.
+The whitening algorithm introduced here is borrowed from Adam Coates' +matlab sample code accompanying his paper +The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization with Andrew Ng at ICML 2011.
+Whitening is often not discussed in detail in papers, but it can be crucial to reproducing state-of-the-art performance. The Stanford UFLDL Tutorial +has a nice section on whitening.
+Building on the previous notebook on autoencoders, let's see how a standard autoencoder works out of the box on image patches.
+# -- SCRIPT INITIALIZATION
+
+import numpy as np
+import autodiff
+from util import show_filters
+from util import random_patches
+from util import mean_and_std
+from skdata import cifar10
+
+n_examples = 10000
+n_patches = 10000
+patch_H = 8
+patch_W = 8
+dtype='float32'
+# -- TIP: restart the IPython kernel to clear out memory
+data_view = cifar10.view.OfficialImageClassification(x_dtype=dtype, n_train=n_examples)
+x = data_view.train.x[:n_examples]
+x_patches = random_patches(x.transpose(0, 3, 1, 2), n_patches, patch_H, patch_W, np.random)
+x_patches_flat = x_patches.reshape(n_patches, -1)
+
+# -- define a show_filters function for our rgb data
+def show_flat_rgb(X, grid_shp):
+ K = patch_W * patch_H
+ # this is a bit of funny business to show RGB
+ show_filters((X[:, :K], X[:, K:2 * K], X[:, 2 * K:], None), (patch_H, patch_W), grid_shp)
+
+show_flat_rgb(x_patches_flat, (10, 10))
+# -- DEFINE THE AUTO-ENCODER
+
+def logistic(u):
+ """Return logistic sigmoid of float or ndarray `u`"""
+ return 1.0 / (1.0 + exp(-u))
+
+def squared_distance(x, z):
+ return ((x - z) ** 2).sum(axis=1)
+
+
+def training_criterion(W, b, c, x):
+ h = logistic(dot(x, W) + b)
+ z = logistic(dot(h, W.T) + c)
+ return squared_distance(x, z).mean()
+
+# -- tip: choose the number of hidden units as a pair, so that show_filters works
+n_visible = x_patches_flat.shape[1]
+n_hidden2 = (20, 20)
+n_hidden = np.prod(n_hidden2)
+W = np.random.uniform(
+ low=-4 * np.sqrt(6. / (n_hidden + n_visible)),
+ high=4 * np.sqrt(6. / (n_hidden + n_visible)),
+ size=(n_visible, n_hidden)).astype(dtype)
+b = np.zeros(n_hidden).astype(dtype)
+c = np.zeros(n_visible).astype(dtype)
+# -- TRAIN THE AUTOENCODER ON RAW PATCHES
+
+streams={'x': x_patches.reshape((n_examples, 1, n_visible))}
+W, b, c = autodiff.fmin_sgd(training_criterion,
+ args=(W, b, c),
+ streams=streams,
+ stepsize=0.01,
+ loops=5, # -- fmin_sgd can iterate over the streams repeatedly
+ print_interval=1000,
+ )
+show_flat_rgb(W.T, n_hidden2)
+As you will probably see - despite the recontruction error dropping steadily, the filters of our auto-encoder do not resemble anything like Gabor edge detectors. To see Gabors, we will need to whiten the image patches.
+There are several techniques for whitening, this one combines a technique used in recent papers from Andrew Ng's group at Stanford, with techniques developed by Nicolas Pinto at MIT.
+# -- fun hyper-parameters to optimize. By fun, I admittedly mean annoying.
+remove_mean = True
+hard_beta = True
+beta = 10.0
+gamma = 0.01
+
+def contrast_normalize(patches):
+ X = patches
+ if X.ndim != 2:
+ raise TypeError('contrast_normalize requires flat patches')
+ if remove_mean:
+ xm = X.mean(1)
+ else:
+ xm = X[:,0] * 0
+ Xc = X - xm[:, None]
+ l2 = (Xc * Xc).sum(axis=1)
+ figure()
+ title('l2 of image patches')
+ hist(l2)
+ if hard_beta:
+ div2 = np.maximum(l2, beta)
+ else:
+ div2 = l2 + beta
+ X = Xc / np.sqrt(div2[:, None])
+ return X
+
+def ZCA_whiten(patches):
+ # -- ZCA whitening (with band-pass)
+
+ # Algorithm from Coates' sc_vq_demo.m
+
+ X = patches.reshape(len(patches), -1).astype('float64')
+
+ X = contrast_normalize(X)
+ print 'patch_whitening_filterbank_X starting ZCA'
+ M, _std = mean_and_std(X)
+ Xm = X - M
+ assert Xm.shape == X.shape
+ print 'patch_whitening_filterbank_X starting ZCA: dot', Xm.shape
+ C = np.dot(Xm.T, Xm) / (Xm.shape[0] - 1)
+ print 'patch_whitening_filterbank_X starting ZCA: eigh'
+ D, V = np.linalg.eigh(C)
+ figure()
+ plot(D)
+ title('Eigenspectrum of image patches')
+ print 'patch_whitening_filterbank_X starting ZCA: dot', V.shape
+ P = np.dot(np.sqrt(1.0 / (D + gamma)) * V, V.T)
+ assert M.ndim == 1
+ return M, P, X
+
+ZCA_M, ZCA_P, ZCA_X = ZCA_whiten(x_patches_flat)
+x_white_flat = dot(ZCA_X - ZCA_M, ZCA_P)
+figure()
+show_flat_rgb(x_white_flat[:100], (10, 10))
+title('ZCA-whitened image patches')
+show()
+# -- TRAIN THE AUTOENCODER ON WHITENED PATCHES
+
+# Remember to re-initialize the model parameters if you were training them
+# on the raw patches earlier
+
+whitened_streams={'x': x_white_flat.reshape((n_examples, 1, n_visible))}
+W, b, c = autodiff.fmin_sgd(training_criterion,
+ args=(W, b, c),
+ streams=whitened_streams,
+ stepsize=0.01,
+ loops=5, # -- fmin_sgd can iterate over the streams repeatedly
+ print_interval=1000,
+ )
+show_flat_rgb(W.T, n_hidden2)
+Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:
+Theano has been powering large-scale computationally intensive scientific investigations since 2007.
+Theano is now available on
+PyPI, and can be installed via
+$ easy_install Theano
+ -- or --
+$ pip install Theano
or by downloading and unpacking the tarball and typing python setup.py install.
Those interested in bleeding-edge features should obtain the latest development version, available via:
+git clone git://github.com/Theano/Theano.git
You can then place the checkout directory on your $PYTHONPATH or use python setup.py develop to install a .pth into your site-packages directory, so that when you pull updates via Git, they will be automatically reflected the “installed” version. For more information about installation and configuration, see
+installing Theano.
When Theano is installed, you will be able to import it in Python like this:
+import theano
+from theano import tensor as TT
+Theano has been developed by the LISA Machine Learning group at the University of Montreal.
+Many lab members and members of the Python community at large have made significant and helpful contributions.
+Here are some quick guides to NumPy:
+ +Rows are horizontal and columns are vertical. +Every row is an example. Therefore, inputs[10,5] is a matrix of 10 examples +where each example has dimension 5. If this would be the input of a +neural network then the weights from the input to the first hidden +layer would represent a matrix of size (5, #hid).
+Consider this array:
+a = np.asarray([[1., 2], [3, 4], [5, 6]])
+print a
+print a.shape
+This is a 3x2 matrix, i.e. there are 3 rows and 2 columns.
+To access the entry in the 3rd row (row #2) and the 1st column (column #0):
+print a[2, 0]
+To remember this, keep in mind that we read left-to-right, top-to-bottom, +so each thing that is contiguous is a row. That is, there are 3 rows +and 2 columns.
+Numpy does broadcasting of arrays of different shapes during +arithmetic operations. What this means in general is that the smaller +array (or scalar) is broadcasted across the larger array so that they have +compatible shapes. The example below shows an instance of +broadcastaing:
+a = numpy.asarray([1.0, 2.0, 3.0])
+b = 2.0
+print a * b
+The smaller array b (actually a scalar here, which works like a 0-d array) in this case is broadcasted to the same size
+as a during the multiplication. This trick is often useful in
+simplifying how expression are written. More detail about broadcasting
+can be found in the numpy user guide.
In this first Theano tutorial we'll see how to do a simple calculation involving scalars and matrices.
+To get us started with Theano and get a feel of what we're working with, +let's make a simple function: add two numbers together. Here is how you do +it:
+import theano.tensor as T
+from theano import function
+x = T.dscalar('x')
+y = T.dscalar('y')
+z = x + y
+f = function([x, y], z)
+And now that we've created our function we can use it:
+print f(2, 3)
+print f(16.3, 12.1)
+Let's break this down into several steps. The first step is to define
+two symbols (Variables) representing the quantities that you want
+to add. Note that from now on, we will use the term
+Variable to mean "symbol" (in other words,
+x, y, z are all Variable objects). The output of the function
+f is a numpy.ndarray with zero dimensions.
If you are following along and typing into an interpreter, you may have
+noticed that there was a slight delay in executing the function
+instruction. Behind the scene, f was being compiled into C code.
A Variable is the main data structure you work with when
+using Theano. The symbolic inputs that you operate on are
+Variables and what you get from applying various operations to
+these inputs are also Variables. For example, when I type
+"x = theano.tensor.ivector(); y = -x" then x and y are both Variables, i.e. instances of the
+theano.gof.graph.Variable class. The
+type of both x and y is theano.tensor.ivector.
x = T.dscalar('x')
+y = T.dscalar('y')
+In Theano, all symbols must be typed. In particular, T.dscalar
+is the type we assign to "0-dimensional arrays (scalar) of doubles
+(d)". It is a Theano :ref:type.
dscalar is not a class. Therefore, neither x nor y
+are actually instances of dscalar. They are instances of
+:class:TensorVariable. x and y
+are, however, assigned the theano Type dscalar in their type
+field, as you can see here:
print type(x)
+print x.type
+print T.dscalar
+print x.type is T.dscalar
+By calling T.dscalar with a string argument, you create a
+Variable representing a floating-point scalar quantity with the
+given name. If you provide no argument, the symbol will be unnamed. Names
+are not required, but they can help debugging.
More will be said in a moment regarding Theano's inner structure. You
+could also learn more by looking into :ref:graphstructures.
The second step is to combine x and y into their sum z:
+z = x + y
+z is yet another Variable which represents the addition of
+x and y. You can use the :ref:pp <libdoc_printing>
+function to pretty-print out the computation associated to z.
from theano.printing import pp
+print pp(z) # -- prints (x + y)
+The last step is to create a function taking x and y as inputs +and giving z as output:
+f = function([x, y], z)
+The first argument to function() is a list of Variables
+that will be provided as inputs to the function. The second argument
+is a single Variable or a list of Variables. For either case, the second
+argument is what we want to see as output when we apply the function. f may
+then be used like a normal Python function.
evalAs a shortcut, you can skip step 3, and just use a variable's
+eval() method. The eval() method is not as flexible
+as function() but it can do everything we've covered in
+the tutorial so far. It has the added benefit of not requiring
+you to import function(). Here is how eval() works:
import theano.tensor as T
+x = T.dscalar('x')
+y = T.dscalar('y')
+z = x + y
+z.eval({x : 16.3, y : 12.1})
+array(28.4)
+
+We passed eval a dictionary mapping symbolic theano
+variables to the values to substitute for them, and it returned
+the numerical value of the expression. eval() will be slower the first time you call it on a variable --
+it needs to call function() to compile the expression behind
+the scenes. Subsequent calls to eval() on that same variable
+will be fast, because the variable caches the compiled function.
You might already have guessed how to do this. Indeed, the only change +from the previous example is that you need to instantiate x and +y using the matrix Types:
+x = T.dmatrix('x')
+y = T.dmatrix('y')
+z = x + y
+f = function([x, y], z)
+dmatrix is the Type for matrices of doubles. Then we can use
+our new function on 2D arrays:
f([[1, 2], [3, 4]], [[10, 20], [30, 40]])
+The variable is a NumPy array. We can also use NumPy arrays directly as +inputs:
+import numpy
+f(numpy.array([[1, 2], [3, 4]]), numpy.array([[10, 20], [30, 40]]))
+It is possible to add scalars to matrices, vectors to matrices, +scalars to vectors, etc. The behavior of these operations is defined by Theano's broadcasting behavior: +[libdoc] broadcasting in NumPy and Theano.
+The following types are available:
+bscalar, bvector, bmatrix, brow, bcol, btensor3, btensor4wscalar, wvector, wmatrix, wrow, wcol, wtensor3, wtensor4iscalar, ivector, imatrix, irow, icol, itensor3, itensor4lscalar, lvector, lmatrix, lrow, lcol, ltensor3, ltensor4fscalar, fvector, fmatrix, frow, fcol, ftensor3, ftensor4dscalar, dvector, dmatrix, drow, dcol, dtensor3, dtensor4cscalar, cvector, cmatrix, crow, ccol, ctensor3, ctensor4The previous list is not exhaustive and a guide to all types compatible +with NumPy arrays may be found here: +[libdoc] tensor creation.
+Note that you, the user---not the system architecture---have to choose whether your
+ program will use 32- or 64-bit integers (i prefix vs. the l prefix)
+ and floats (f prefix vs. the d prefix).
Here's another straightforward example, though a bit more elaborate +than adding two numbers together. Let's say that you want to compute +the logistic curve, which is given by:
+$s(x) = \frac{1}{1 + e^{-x})}$
+It's called a sigmoid curve because it looks "S"-shaped. If you plot it, you'll see why:
+x = np.arange(-5, 5, .1)
+title('logistic sigmoid')
+line = plot(x, 1 / (1 + np.exp(-x)))
+NumPy and Theano both evaluate scalar functions like this element-wise when operating on tensor quantities such as vectors and matrices. +So if you want to compute the logistic sigmoid of each number in x, then you do it by applying all of the required operations -- division, addition, exponentiation, and division -- on entire tensors at once.
+x = T.dmatrix('x')
+s = 1 / (1 + T.exp(-x))
+logistic = function([x], s)
+print logistic([[0, 1], [-1, -2]])
+At this point it would be wise to begin familiarizing yourself +more systematically with Theano's fundamental objects and operations by browsing +this section of the library: +Theano basic tensor functionality.
+As the tutorial unfolds, you should also gradually acquaint yourself with the other +relevant areas of the library and with the relevant subjects of the documentation +entrance page.
+Modify and execute the following code to compute the expression: $a^2 + b^2 + 2 a b$.
+import theano
+a = theano.tensor.vector() # declare variable
+out = a + a ** 10 # build symbolic expression
+f = theano.function([a], out) # compile function
+print f([0, 1, 2]) # prints `array([0, 2, 1026])`
+The previous tutorials (Part 1, Part 2) have already used theano.function to create callable objects from Theano graphs. This tutorial will walk through the various capabilities of function and the important concept of shared variables.
The library documentation of function is a good companion to this tutorial.
# -- imports we need for this tutorial
+from theano import tensor as T
+from theano import function, shared
+Theano supports functions with multiple outputs. For example, we can +compute the elementwise difference, absolute difference, and +squared difference between two matrices a and b at the same time.
+When we use the function f, it returns the three computed results as a list.
a, b = T.dmatrices('a', 'b')
+diff = a - b
+abs_diff = abs(diff)
+diff_squared = diff ** 2
+
+f = function([a, b], [diff, abs_diff, diff_squared])
+
+print f([[1, 1], [1, 1]], [[0, 1], [2, 3]])
+(Note that dmatrices produces as many outputs as names that you provide.
+It is a shortcut for allocating symbolic variables that we will often use in the tutorials,
+but it is not so typically used outside tutorial examples.)
Let's say you want to define a function that adds two numbers, except
+that if you only provide one number, the other input is assumed to be
+one. In Python, the default value for parameters achieves this effect.
+In Theano you can achieve this effect with a Param object.
This makes use of the Param class which allows
+you to specify properties of your function's parameters with greater detail. Here we
+give a default value of 1 for y by creating a Param instance with
+its default field set to 1. Inputs with default values must follow inputs without default
+values (like Python's functions). There can be multiple inputs with default values. These parameters can
+be set positionally or by name, as in standard Python.
# -- Theano version of lambda x, y=1: x + y
+
+from theano import Param
+x, y = T.dscalars('x', 'y')
+z = x + y
+f = function([x, Param(y, default=1)], z)
+
+print f(33)
+print f(33, 2)
+print f(34)
+# -- Theano version of lambda x, y=1, w_by_name=2: (x + y) * w
+x, y, w = T.dscalars('x', 'y', 'w')
+z = (x + y) * w
+g = function([x, Param(y, default=1), Param(w, default=2, name='w_by_name')], z)
+print g(33)
+print g(33, 2)
+print g(33, 0, 1)
+print g(33, w_by_name=1)
+print g(33, w_by_name=1, y=0)
+N.B.Param does not know the name of the local variables y and w
+that are passed to it as arguments. If the symbolic variable objects have name
+attributes (set by dscalars in the example above) then these are the
+names of the keyword parameters in the functions that we build. This is
+the mechanism at work in Param(y, default=1). In the case of
+Param(w, default=2, name='w_by_name'), we override the symbolic variable's name
+attribute with a new name to be used as the parameter name for g.)
It is also possible to make a function with an internal state. For example, let’s say we want to make an accumulator: at the beginning, the state is initialized to zero. Then, on each function call, the state is incremented by the function’s argument.
+First let’s define the accumulator function. It adds its argument to the internal state, and returns the old state value.
+from theano import shared
+state = shared(0)
+inc = T.iscalar('inc')
+accumulator = function([inc], state, updates=[(state, state+inc)])
+This code introduces a few new concepts. The shared function constructs so-called shared variables. These are hybrid symbolic and non-symbolic variables whose value may be shared between multiple functions. Shared variables can be used in symbolic expressions just like the objects returned by dmatrices(...) but they also have an internal value that defines the value taken by this symbolic variable in all the functions that use it. It is called a shared variable because its value is shared between many functions. The value can be accessed and modified by the .get_value() and .set_value() methods. We will come back to this soon.
+The other new thing in this code is the updates parameter of function. updates must be supplied with a list of pairs of the form (shared-variable, new expression). It can also be a dictionary whose keys are shared-variables and values are the new expressions. Either way, it means “whenever this function runs, it will replace the .value of each shared variable with the result of the corresponding expression”. Above, our accumulator replaces the state‘s value with the sum of the state and the increment amount.
+Let’s try it out!
+print state.get_value()
+print accumulator(1)
+print state.get_value()
+print accumulator(300)
+print state.get_value()
+It is possible to reset the state. Just use the .set_value() method:
+state.set_value(-1)
+print accumulator(3)
+print state.get_value()
+As we mentioned above, you can define more than one function to use the same shared variable. These functions can all update the value.
+decrementor = function([inc], state, updates=[(state, state-inc)])
+print decrementor(2)
+print state.get_value()
+You might be wondering why the updates mechanism exists. You can always achieve a similar result by returning the new expressions, and working with them in NumPy as usual. While the updates mechanism can be a syntactic convenience, it is mainly there for efficiency. Updates to shared variables can sometimes be done more quickly using in-place algorithms (e.g. low-rank matrix updates). Also, Theano has more control over where and how shared variables are allocated, which is one of the important elements of getting good performance on the GPU.
+givensIt may happen that you expressed some formula using a shared variable, but you do not want to use its value. In this case, you can use the givens parameter of function which replaces a particular node in a graph for the purpose of one particular function.
+fn_of_state = state * 2 + inc
+foo = T.lscalar() # the type (lscalar) must match the shared variable we
+ # are replacing with the ``givens`` list
+skip_shared = function([inc, foo], fn_of_state, givens=[(state, foo)])
+print skip_shared(1, 3) # we're using 3 for the state, not state.value
+print state.get_value() # old state still there, we didn't use it
+The givens parameter can be used to replace any symbolic variable, not just a shared variable. You can replace constants, and any other expression as long as the new expression has the same type as the one you are replacing.
+A good way of thinking about the givens is as a mechanism that allows you to replace any part of your formula with a different expression that evaluates to a tensor of same shape and dtype.
+WIP, TODO, etc.
+Because in Theano you first express everything symbolically and +afterwards compile this expression to get functions, +using pseudo-random numbers is not as straightforward as it is in +NumPy, though also not too complicated.
+The way to think about putting randomness into Theano's computations is
+to put random variables in your graph. Theano will allocate a NumPy
+RandomStream object (a random number generator) for each such
+variable, and draw from it as necessary. We will call this sort of
+sequence of random numbers a random stream. Random streams are at
+their core shared variables, so the observations on shared variables
+hold here as well. Theanos's random objects are defined and implemented in
+:ref:RandomStreams<libdoc_tensor_shared_randomstreams> and, at a lower level,
+in :ref:RandomStreamsBase<libdoc_tensor_raw_random>.
Here's a brief example. The setup code is:
+.. If you modify this code, also change : +.. theano/tests/test_tutorial.py:T_examples.test_examples_9
+.. code-block:: python
+from theano.tensor.shared_randomstreams import RandomStreams
+srng = RandomStreams(seed=234)
+rv_u = srng.uniform((2,2))
+rv_n = srng.normal((2,2))
+f = function([], rv_u)
+g = function([], rv_n, no_default_updates=True) #Not updating rv_n.rng
+nearly_zeros = function([], rv_u + rv_u - 2 * rv_u)
+
+Here, 'rv_u' represents a random stream of 2x2 matrices of draws from a uniform
+distribution. Likewise, 'rv_n' represents a random stream of 2x2 matrices of
+draws from a normal distribution. The distributions that are implemented are
+defined in :class:RandomStreams and, at a lower level, in :ref:raw_random<libdoc_tensor_raw_random>.
.. TODO: repair the latter reference on RandomStreams
+Now let's use these objects. If we call f(), we get random uniform numbers. +The internal state of the random number generator is automatically updated, +so we get different random numbers every time.
+++++++f_val0 = f() +f_val1 = f() #different numbers from f_val0
+
When we add the extra argument no_default_updates=True to
+function (as in g), then the random number generator state is
+not affected by calling the returned function. So, for example, calling
+g multiple times will return the same numbers.
++++++g_val0 = g() # different numbers from f_val0 and f_val1 +g_val1 = g() # same numbers as g_val0!
+
An important remark is that a random variable is drawn at most once during any +single function execution. So the nearly_zeros function is guaranteed to +return approximately 0 (except for rounding error) even though the rv_u +random variable appears three times in the output expression.
+++++++nearly_zeros = function([], rv_u + rv_u - 2 * rv_u)
+
Random variables can be seeded individually or collectively.
+You can seed just one random variable by seeding or assigning to the
+.rng attribute, using .rng.set_value().
++++++rng_val = rv_u.rng.get_value(borrow=True) # Get the rng for rv_u +rng_val.seed(89234) # seeds the generator +rv_u.rng.set_value(rng_val, borrow=True) # Assign back seeded rng
+
You can also seed all of the random variables allocated by a :class:RandomStreams
+object by that object's seed method. This seed will be used to seed a
+temporary random number generator, that will in turn generate seeds for each
+of the random variables.
++++++srng.seed(902340) # seeds rv_u and rv_n with different seeds each +Sharing Streams Between Functions
+
As usual for shared variables, the random number generators used for random +variables are common between functions. So our nearly_zeros function will +update the state of the generators used in function f above.
+For example:
+++++++state_after_v0 = rv_u.rng.get_value().get_state() +nearly_zeros() # this affects rv_u's generator +v1 = f() +rng = rng.get_value(borrow=True) +rng.set_state(state_after_v0) +rv_u.rng.set_value(rng, borrow=True) +v2 = f() # v2 != v1
+
There are :ref:other distributions implemented <libdoc_tensor_raw_random>.
TODO: OTHER RandomStreams base implementations (CUDA, OpenCL, MRG, etc.)
+