|
|
|
Simon Dennis |
|
School of Psychology |
|
University of Queensland |
|
Materials by Devin McAuley available at
http://psy.uq.edu.au/~brainwav/Manual. |
|
|
|
|
|
|
In many real world situations, we are faced with
incomplete or noisy data, and it is important to be able to make reasonable
predictions about novel cases from the information available. |
|
Backpropagation networks adapt their weights to
acquire a training set of input/output pattern pairs. |
|
After a
Backpropagation network has learned, it can be applied to a test set of
input patterns to see how well it generalizes to untrained patterns. |
|
|
|
|
|
|
Name
Age Educ Marrge Occptn
Gang |
|
Bill 40 Coll Single Pusher Jets |
|
Mike 20 H.S. Single Pusher Jets |
|
Joan 20 J.H. Single Pusher Jets |
|
Cath 20 Coll Marrie Pusher Jets |
|
John 20 Coll Divorc Pusher Jets |
|
Bert 20 Coll Single Burglr Jets |
|
Margart 30 J.H Marrie Bookie Shark |
|
Janet 20 J.H. Marrie Bookie Shark |
|
|
|
|
|
|
|
|
|
|
One of the earliest learning networks was
proposed by Rosenblatt in the late 1950's. |
|
The task of Rosenblatt's "perceptron"
was to discover a set of connection weights that correctly classified a set
of binary input vectors. |
|
|
|
|
If the input vector is correctly classified,
then the weights are left unchanged, and the next input vector is
presented. |
|
If the output unit is 1 when the correct
classification is 0, then the threshold is incremented by 1. |
|
If the input Ii is 0, then the
corresponding weight wi is left unchanged. If the input Ii
is 1, then the corresponding weight is decreased by 1. |
|
When the output is turned off when it should be
turned on, the opposite changes are made. |
|
|
|
|
Weight1
Weight2 Threshld Input1
Input2 Output Target |
|
|
|
-0.5 0.5 2.5 1 1 0 1 |
|
0.5 1.5 1.5 1 0 0 1 |
|
1.5 1.5 0.5 0 1 1 1 |
|
1.5 1.5 0.5 0 0 0 0 |
|
|
|
|
The perceptron learning rule is always able to
discover a set of weights that correctly classifies its inputs, given that
the set of weights exists. |
|
The Perceptron Convergence Theorem fueled much
of the early interest in neural networks. |
|
|
|
|
The limitations of perception were documented by
Minsky and Papert in their book Perceptrons (Minksy and Papert, 1969). |
|
The now classic example of a simple function
that can not be computed by a perceptron is the exclusive-or (XOR) problem. |
|
|
|
|
|
|
Problems which can be solved by separating the
two classes with a hyperplane are called linearly separable. |
|
The XOR problem is not a linearly separable
problem. |
|
|
|
|
Given a sufficient number of hidden units, it is
possible to recode any unsolvable problem into a linearly separable one. |
|
|
|
|
A hidden layer provides a pool of units from
which features (which help distinguish the inputs) can potentially develop. |
|
Since there are no target activations for the
hidden units, the perceptron learning rule does not extend to multilayer
networks. |
|
So, how can a network develop an internal
representation of a pattern? |
|
|
|
|
|
The Backpropagation algorithm solves the credit
assignment problem for multilayer networks. |
|
First proposed by Paul Werbos in the 1970's.
However, it wasn't until it was rediscovered in 1986 by Rumelhart and
McClelland that BackPropagation became widely used. |
|
Two parts: |
|
Sigmoid Activation Rule |
|
Gradient Descent using Error Backpropagation |
|
|
|
|
|
|
Net input: |
|
netj = Siwjiai |
|
Activation Rule: (sigmoid): |
|
F(netj) = 1/(1 + e-netj+
qj ). |
|
The sigmoid function is differentiable |
|
The bias term plays the same role as the
threshold in the perceptron. |
|
|
|
|
After the error is computed, each weight is
adjusted in proportion to the error gradient backpropagated from the
outputs to the inputs. The changes in the weights reduce the overall error
in the network. |
|
|
|
|
|
Weights move in the direction of steepest
descent on the error surface defined by the Error Function: |
|
E = 1/2 Sp Sj(tpj-opj) 2 |
|
Weights are changed in proportion to the
negative of an error derivative with respect to each weight: |
|
dwji
= -e [dE/dwji] |
|
Because the activation function is continuous dE/dwji can be
calculated. |
|
|
|
|
Error backpropagated through the network is
proportional to the value of the weights. |
|
If all the weights are the same, then the
backpropaged errors will be the same, and all of the weights will be
updated by the same amount. |
|
If the solution to the problem requires that the
network learn unequal weights, then having equal weights to start with will
prevent the network from learning. |
|
It is also helpful if the activations of the
units are chosen to be small values (see Neural Networks by example,
Chapter 5 for an explanation). |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The more hidden units you have in a network the
less likely you are to encounter a local minimum. |
|
|
|
|
The larger the learning rate e, the larger the weight
changes on each epoch, and the quicker the network learns. |
|
If the learning rate gets too large, then the
weight changes no longer approximate a gradient descent procedure.
Oscillation of the weights is often the result. |
|
We would like to use the largest learning rate
possible without triggering oscillation. |
|
|
|
|
|
Learning rule with momentum: |
|
dwji(n+1)
= edpjapi
+ adwji(n) |
|
Helps to roll past local minima and to
accelerate down shallow slopes |
|
In BrainWave |
|
default learning rate = 0.25 |
|
default momentum = 0.9 |
|
When applying backpropagation, you will often
want to use much smaller values than these. For especially difficult tasks,
a learning rate of 0.01 is not uncommon. |
|
|
|
|
McAuley, J. D. (1997). The Backpropation
Network: Learning by example. In Neural Networks by Example at http://psy.uq.edu.au/~brainwav/Manual. |
|
McClelland, J. L. & Rumelhart, D. E. (Eds.).
(1988). Explorations in parallel distributed processing: A handbook of
models, programs, and exercises. Cambridge, MA: MIT Press. |
|
Rumelhart, D. E., & McClelland, J. L.
(Eds.). (1986). Parallel distributed processing: Explorations in the
microstructure of cognition (Vol. 1). Cambridge, MA: MIT Press. |
|