Notes
Outline
The BackPropagation Network: Learning by Example
Simon Dennis
School of Psychology
University of Queensland
Materials by Devin McAuley available at http://psy.uq.edu.au/~brainwav/Manual.
Introduction
In many real world situations, we are faced with incomplete or noisy data, and it is important to be able to make reasonable predictions about novel cases from the information available.
Backpropagation networks adapt their weights to acquire a training set of input/output pattern pairs.
 After a Backpropagation network has learned, it can be applied to a test set of input patterns to see how well it generalizes to untrained patterns.
Supervised Learning
Jets and Sharks Example Data
Name    Age  Educ Marrge Occptn Gang
Bill 40 Coll Single Pusher Jets
Mike 20 H.S. Single Pusher Jets
Joan 20 J.H. Single Pusher Jets
Cath 20 Coll Marrie Pusher Jets
John 20 Coll Divorc Pusher Jets
Bert 20 Coll Single Burglr Jets
Margart 30 J.H Marrie Bookie Shark
Janet 20 J.H. Marrie Bookie Shark
Jets and Sharks Network
Threshold Logic Units
Implementing the AND Function
Simple Learning Machines
One of the earliest learning networks was proposed by Rosenblatt in the late 1950's.
The task of Rosenblatt's "perceptron" was to discover a set of connection weights that correctly classified a set of binary input vectors.
Perceptron Learning Procedure
If the input vector is correctly classified, then the weights are left unchanged, and the next input vector is presented.
If the output unit is 1 when the correct classification is 0, then the threshold is incremented by 1.
If the input Ii is 0, then the corresponding weight wi is left unchanged. If the input Ii is 1, then the corresponding weight is decreased by 1.
When the output is turned off when it should be turned on, the opposite changes are made.
Perceptron Learning Example: Logical OR
Weight1  Weight2  Threshld  Input1  Input2  Output  Target
 -0.5      0.5       2.5       1       1       0       1
  0.5      1.5       1.5       1       0       0       1
  1.5      1.5       0.5       0       1       1       1
  1.5      1.5       0.5       0       0       0       0
Convergence of the PLP
The perceptron learning rule is always able to discover a set of weights that correctly classifies its inputs, given that the set of weights exists.
The Perceptron Convergence Theorem fueled much of the early interest in neural networks.
Why a Hidden Layer?
The limitations of perception were documented by Minsky and Papert in their book Perceptrons (Minksy and Papert, 1969).
The now classic example of a simple function that can not be computed by a perceptron is the exclusive-or (XOR) problem.
The XOR Problem
The Geometric Description
Problems which can be solved by separating the two classes with a hyperplane are called linearly separable.
The XOR problem is not a linearly separable problem.
Using a Hidden Layer to Solve the XOR problem
Given a sufficient number of hidden units, it is possible to recode any unsolvable problem into a linearly separable one.
The Credit Assignment Problem in Multilayer Networks
A hidden layer provides a pool of units from which features (which help distinguish the inputs) can potentially develop.
Since there are no target activations for the hidden units, the perceptron learning rule does not extend to multilayer networks.
So, how can a network develop an internal representation of a pattern?
The Backpropagation Algorithm
The Backpropagation algorithm solves the credit assignment problem for multilayer networks.
First proposed by Paul Werbos in the 1970's. However, it wasn't until it was rediscovered in 1986 by Rumelhart and McClelland that BackPropagation became widely used.
Two parts:
Sigmoid Activation Rule
Gradient Descent using Error Backpropagation
The Sigmoid Activation Rule
Net input:
netj = Siwjiai
Activation Rule: (sigmoid):
F(netj) = 1/(1 + e-netj+ qj ).
The sigmoid function is differentiable
The bias term plays the same role as the threshold in the perceptron.
Gradient Descent
After the error is computed, each weight is adjusted in proportion to the error gradient backpropagated from the outputs to the inputs. The changes in the weights reduce the overall error in the network.
The Error Surface
Weights move in the direction of steepest descent on the error surface defined by the Error Function:
E = 1/2 Sp Sj(tpj-opj) 2
Weights are changed in proportion to the negative of an error derivative with respect to each weight:
dwji = -e [dE/dwji]
Because the activation function is continuous dE/dwji can be calculated.
How to Select Initial Weights
Error backpropagated through the network is proportional to the value of the weights.
If all the weights are the same, then the backpropaged errors will be the same, and all of the weights will be updated by the same amount.
If the solution to the problem requires that the network learn unequal weights, then having equal weights to start with will prevent the network from learning.
It is also helpful if the activations of the units are chosen to be small values (see Neural Networks by example, Chapter 5 for an explanation).
Local Minima
The more hidden units you have in a network the less likely you are to encounter a local minimum.
Learning Rate
The larger the learning rate e, the larger the weight changes on each epoch, and the quicker the network learns.
If the learning rate gets too large, then the weight changes no longer approximate a gradient descent procedure. Oscillation of the weights is often the result.
We would like to use the largest learning rate possible without triggering oscillation.
Momentum
Learning rule with momentum:
dwji(n+1) = edpjapi + adwji(n)
Helps to roll past local minima and to accelerate down shallow slopes
In BrainWave
default learning rate = 0.25
default momentum = 0.9
When applying backpropagation, you will often want to use much smaller values than these. For especially difficult tasks, a learning rate of 0.01 is not uncommon.
References
McAuley, J. D. (1997). The Backpropation Network: Learning by example. In Neural Networks by Example at http://psy.uq.edu.au/~brainwav/Manual.
McClelland, J. L. & Rumelhart, D. E. (Eds.). (1988). Explorations in parallel distributed processing: A handbook of models, programs, and exercises. Cambridge, MA: MIT Press.
Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1). Cambridge, MA: MIT Press.