## Excerpt

## LIST OF FIGURES

1.1 Timeline for research of Neural Network

1.2 The Rashevsky XOR Network

1.3 McCulloch & Pitts Logical AND

1.4 The McCulloch & Pitts Neuron

1.5 Rosenblatt’s Perceptron

1.6 Adaptive linear element (ADALINE)

1.7 Separable AND and OR and non-separable XOR

1.8 XOR Solution

1.9 Various Activation Functions

1.10 Three Layer MLP Network

1.11 Competitive Learning

1.12 Radial Basis Function Network

1.13 Standard versus Normalized RBFs

1.14 The Evolutionary Cycle

1.15 Triangular and Trapezoidal (Shoulder) MFs

1.16 Fuzzy Logic System Structure

1.17 TSK Method

1.18 Adaptive Neuro Fuzzy Inference System (ANFIS)

2.1 The Biological Neuron

2.2 The Synapse

2.3 Linear and Poisson Encoding Schemes

2.4 Poisson Probability Density Function

2.5 Conductance-based LIF Neuron Dynamics

2.6 The STDP Learning Function

2.7 Function Approximation using SHL

2.8 Constrained Function Approximation using SHL

2.9 PSR for a Facilitating Synapse

2.10 PSR for a Depressing Synapse

2.11 Centre surround receptive field

3.1 The amount and type of synaptic modification (STDP) evoked by repeated pairing of pre- and postsynaptic action potentials in different preparations

3.2 (A) Feedforward spiking neural network

3.2 (B) Connection consisting of multiple delayed synaptic terminals

3.3 Summary of the neuro-computational properties of biological spiking neurons

4.1 Types of Neurons

4.2 Types of Synapses

4.3(Ai) Input Spike Train

4.3(A2) Conversion of o and 1 to frequencies

4.3(B) RBF-SNN topology for XOR

4.3(C) RF Configuration for each hidden layer

4.4 Input Spikes

4.5 Supervision using a modified-SHL algorithm

5.1(A1) Input Spike Train

5.1(A2) Conversion of o and 1 to frequencies

5.1(B) RBF-SNN topology for XOR

5.1(C) RF Configuration for each hidden layer

5.2 Input and Hidden layer Processing

5.3 Supervision using a modified-SHL algorithm

5.4 Output Layer Processing of Centre-Surround Receptive Fields

5.5 XOR Training Error

5.6 XOR Network with Hidden Layer Centre-surround RFs

5.7 XOR Hidden Layer Processing with Centre-Surround Receptive Fields

5.8 XOR Output Layer Processing of Centre-Surround Receptive Fields

5.9 XOR Improved Training Error

## LIST OF TABLES

1.1 XOR Truth Table

1.2 XOR Perception Solution Results

2.1 Parameters for the Conductance-based LIF Model

3.1 Supervised learning methods for spike time coding in SNN

5.1 Parameters for the Conductance-based LIF Model

## ABBREVIATION

illustration not visible in this excerpt

## CONTENTS

Declaration

Certificate

Acknowledgement.

Abstract

Chapter I: A Review of Neural Network

1.1 Brief History of Artificial Neural Networks

1.1.1 Logical Neural Networks

1.1.2 Early Computer Simulations

1.1.3 The Perceptron

1.1.4 Widrow & Hoff s Delta Rule

1.1.5 The Significance of the XOR Problem

1.2 Summary of Terminology

1.2.1 Activation Functions

1.2.2 Learning

1.3 Backpropagation

1.3.1 Limitations of Backpropagation

1.4 Self-Organising Networks

1.4.1 Competitive Learning

1.5 Radial Basis Function Networks

1.6 Hybrid Topologies

1.6.1 Neuro Evolutionary Systems

1.6.2 Fuzzy Logic Systems

1.6.3 Neuro Fuzzy System

1.7 Summary

Chapter II: Preliminaries of Spiking Neural Networks 29-

2.1 The Biological Neuron

2.1.1 The Role of Ions

2.1.2 Excitatory and Inhibitory Neurons

2.2 3rd Generation Neurons Models (Biophysical and Spiking Neurons model)

2.2.1 Encoding

2.2.2 Morphological Models

2.2.3 Integrate-and-Fire (IF) Model

2.2.4 Hodgkin-Huxley (HH) Model

2.2.5 Leaky Integrate-and-Fire (LIF) Model

2.2.6 Spike Response Model (SRM)

2.2.7 Conductance-based LIF Model

2.3 Unsupervised Learning

2.3.1. Bienenstock, Cooper and Munro’s Model

2.3.2 Hebbian Learning

2.3.3 Synaptic Scaling

2.3.4 Spike timing-dependent plasticity (STDP)

2.4 Supervised Learning

2.4.1 SpikeProp (Gradient Estimation)

2.4.2 Statistical Approach

2.4.3 Supervised Hebbian Learning (SHL)

2.5 Dynamic Synapse

2.6 Receptive Field

2.7 Spiking Neural Network Versus Traditional Neural Networks

2.8 Summary

Chapter III: Literature Review: Spiking Neural Networks 54-

Chapter IV: Fuzzy based Spiking Neural Networks (FBSNN): A Theoretical Work 61-

4.1 Introduction

4.2 Selection of Neuron Model and Topology

4.3 Solution of XOR by SNN

4.4 Using Fuzzy Reasoning

4.5 Summary

Chapter V: Fuzzy based Spiking Neural Networks (FBSNN): An Implementation and results

5.1 Introduction

5.2 Topology of the Network

5.3 Input Layer with Spikes

5.4 Hidden Layer and Receptive Field

5.5 Output Layer with Modified SHL

5.6 Using Fuzzy Reasoning

5.7 Summary

Conclusion

Future Work

References

## CHAPTER A REVIEW OF NEURAL NETWORKS

1. A Review of Neural Networks

1.1 Brief History of Artificial Neural Networks

1.1.1 Logical Neural Networks

1.1.2 Early Computer Simulations

1.1.3 The Perceptron

1.1.4 Widrow & Hoffs Delta Rule

1.1.5 The Significance of the XOR Problem

1.2 Summary of Terminology

1.2.1 Activation Functions

1.2.2 Learning

1.3 Backpropagation

1.3.1 Limitations of Backpropagation

1.4 Self-Organising Networks

1.4.1 Competitive Learning

1.4.2 Self-Organising Map (SOM)

1.5 Radial Basis Function Networks

1.6 Hybrid Topologies

1.6.1 Neuro Evolutionary Systems

1.6.2 Fuzzy Logic Systems

1.6.3 Neuro Fuzzy Systems

1.7 Summary

The puzzle of human brain is as old as human history itself. How we accumulate and associate day-today experiences to build up “self” remains a mystery in spite of all advances in human knowledge. Making machine that mimic the abilities of human brain has been a dream of centuries. Designing intelligent machines remained a part of science fiction stories until recently when computers became popularand the

demand of processed data increased. The science of designing intelligent machines is generally referred to as Machine Learning and the tool that are developed for this purpose are broadly called Neural Network. There are many definitions of neural networks in the literature. Most of the definitions agree that neural networks contain processing units or nodes (neurons) connected in parallel that are derived from mathematical models of biological neurons found in the brain. The architecture differs from conventional computer architecture in that it incorporates learning rather than programming and processes in a parallel as opposed to a sequential manner. In practical terms, neural networks are used to find patterns in data. Typically data is presented to a network in the form of inputs and desired outputs and the network and its training regime are designed to find the input-output relationship or mapping. To illustrate the development of the science the following section presents a brief history of neural networks.

### 1.1 Brief History of Artificial Neural Networks

Traditional Neural Networks also called Connectionist Neural Networks or Artificial Neural Networks (ANN’s) have in the last two decades proved to be robust problem solvers, outperforming traditional statistical and mathematical tools on a massive scale.

Areas like forecasting, function approximation, pattern recognition, pattern completion; optimization, pattern classification, and clustering have bowed to the sovereign reign of neural networks. The development of these paradigms came about in the following way:

Alexander Bain (1873). The understanding of neural dynamics came forth in late 18 century, when Alexander Bain [71] in 1873 published his theory no threshold logic and summation nodes, based on that period’s recent surge in neuroanatomical findings lead by the neurobiologists; Gerlach(i8s8), Nissl(i8s8), and Waldeyerin(i863).

McCulloch and Pitts (1940). Neural computing came about in 1940, when McCulloch and Pitts [63] invented the first artificial neuron, the MP neuron. The MP model was a hardcoded model without learning rules.

Donald Hebb (1949). Donald Hebb [38] set forth his hebbian learning rule in 1949.

The rule simply states that links between a target neuron and connecting neurons should be strengthened towards those neurons that fire around the same time as the target neuron fire. Thus the system learns to react to events which statistical history indicates are of relevance to eachother. This rule is especially applicable for the more bio-plausible spiking networks. With the emergence of computers in the 1950’s the stage was set for porting the neural models to the digital realm.

Frank Rosenblatt (1950). In 1950 Frank Rosenblatt [86] invented the perceptron, a two layer network with a learning algorithm.

Bernard Widrow and Ted Hoff (1960). Then, Bernard Widrow [109] and Ted Hoff invented the ADELINE (ADAptive Linear NEuron) in196o, and later the MADALINE which was primarily used as a noise filter. These models introduced the then novel least mean square (LMS) error training algorithm, also called the Widrow-Hoff Delta Rule. These early models were interesting but suffered from some limitations, they could do some pattern classification, but could for instance not solve a simple XOR problem.

Marvin Minsky and Seymour Papert (1969). The lack of solving the XOR problem led to the famous slaughter of 1969, where Marvin Minsky and Seymour Papert ([67],[68])single-handedly buried neural network research for the coming decade, with their book “Perceptrons: An Introduction to Computational Geometry”, a virtual slugfest on the short comings of neural networks.

Teuvo Kohonen (1972). In 1972 Teuvo Kohonen [52] invented the self organizing map (SOM), which became widely used as a powerful clustering tool. The trick of the SOM is the winner takes all training algorithm who assigns a group of randomly produced vectors to match up with the most prominent clusters of the input data, this is performed by letting the random vector that most closely resembles the current input vector, adapt itself towards resemblance of the current input vector. Applying enough training iterations ends up with transforming the random adaptive vectors to an accurate representation of the most prominent vector-clusters in the population.

John Hopfield (1982). terest soared as John Hopfield [42] revived the field of reverberatory neural networks with his Hopfield Neural Network model, which performs successful pattern completion on input with missing data. This algorithm is performed by a clever interlinked cloning of sub vector parts, where the cloning is done by interlinking weights, so that a partial pattern will trigger the complete patterns main components as a result of the partial signals interlinked weights producing the rest of the signal.

M. J. D. Powell (1985). Powell [78] invented the Radial-Basis Function (RBF) network in 1985. The RBF became a robust and efficient problem solving paradigm that employs a clustering algorithm to find the most prominent clusters in the input hyperspace of multivariate data and then linearly combine hyperspheres around these clusters to determine the classifications of specific input patterns based on examples of previously classified training patterns.

Rumelhart (1986). But the vast problem solving power of neural computing did not fully emerge until 1986, when researchers (D. Rumelhart, G. Hinton, & R. Williams, Le Cun and D. Parker) [88] where able to extend Widrow & Hoffs delta rule to networks with multiple hidden layers, by a training rule known as the generalized delta rule, this model based on the perceptron, became known as the Back Propagation Network, and it has since been proved that such networks are universal function approximations that can approximate any continuous function or functional. And thus perform correct mapping of any set of input variables, to the resulting output result of the underlying function or functional. The point being that the function or functional itself is hidden in the data, but can be inferred and reproduced by the network given a sufficient amount of sample input-output pairs. The underlying mechanisms of the Back Propagation paradigm that solved the problem of training hidden neuron layers, was actually discovered earlier, by Paul Webros in 1974 and Parker and LeCun in 1985, but it was Rumelhart et. al who made it universally known.

G. E. Hinton, and T. J. Sejnowski (1986). Hinton & Sejnowski produced the Boltzmann machine in 1986. The Boltzmann machine is a Hoppfield network derivate, with sophisticated elaborations like the inclusion of annealing, hidden neurons, and stochastic processes. The Boltzmann machine is, like the Hopfield network, used for Pattern completion, and can thus indirectly be used for pattern classification of noisy data or data with missing elements.

Jeff Elman (1991). In 1990 Jeff Elman presented the recurrent neural network, which is a feed-forward network modified by one or more feedback connections. The feedback connection send the input sum of the node back to itself at the next iteration, but reduced by a weight factor, thus having a high weight of 0.99 will see to it that the effect of an input is reproduced by ever decreasing effect in the many inputs that are to follow, this functionality produces a pattern holding reservoir in time, and is employed to catch temporal patterns in time series. At this point in history the neural networks field had matured into a powerful computational field of robust and widely applicable network models, with the major paradigms being:

#### 1.1.1 Logical Neural Networks

The first neural logic circuit was proposed in 1938 by Nicolas Rashevsky [80]. Rashevsky proposed a network to perform the XOR problem (see Figure 1.2), arguing that binary logic was acceptable as a basis for his network since action potentials could be viewed as binary operations. The modern era of neural networks undoubtedly began with the first practical implementation of neural networks using

electrical circuits in 1943 by McCulloch & Pitts [63]. McCulloch & Pitts proposed a neuron model that summed its inputs and fired whenever that sum exceeded a threshold. Using a network of these threshold neurons, such a neuron could implement any logical function. For a neuron with two inputs (see Figure 1.4), and a threshold value of Θ = 2, the logical AND operation is realized.

The figure shows that pairs of binary input data are sequentially fed into the network. The network processes the information and, in all cases the output of the network reflects the logical AND operation. Similarly, setting a threshold of _ = 1 defines the inclusive-OR operation. However, by extending the number of input neurons and introducing the concept of a synaptic weight the network operations cease being logical. Figure 1.4 shows a generic network topology for the McCulloch & Pitts neuron.

illustration not visible in this excerpt

The model had several drawbacks, namely that it was restricted to binary states, and only operated under a discrete-time assumption. Weights and thresholds are fixed and no interaction takes place between network neurons except in managing signal flow.

#### 1.1.2 Early Computer Simulations

With the advent of the computer, many of the early ideas about artificial neurons could be properly explored. Impetus for much of these early simulations was provided by the ideas of Donald Hebb [38]. Hebb added specificity to the earlier ideas of Bain and James ([6], [44]), by addressing the implications of these ideas for synaptic connections. Hebb proposed that the effectiveness of a synapse between two neurons might be increased by correlated activity of those two neurons. Farley and Clarke were the first to simulate a Hebbian-inspired network [28]. The simulations consisted of a network of 32 randomly connected neurons whose connections contained weights and the output implemented McCulloch & Pitts type binary thresholds. The network was organized into four cell assemblies (quadrants) each containing eight cells (neurons). Input data was fed into the two lefthand quadrants, care was taken to ensure that only one left-hand quadrant was active at one time. A pattern was deemed to be correctly learned if it produced more activity in one of the right-hand quadrants over the other. In order for the pattern discrimination to work properly the Hebb rule had to be modified (Farley, 1960). It was discovered that by adding directionality to the rule (i.e. if one neuron is not coincidentally active with another then decrease the weight between them) the experiments had some limited success. However, the topology of the network did not lend itself easily to discriminating between one pattern or another. The topology also drew criticism from the fact that the cell assemblies did not form themselves but were connected manually. This was thought to be necessary at the time. In his Ph.D. dissertation, Minsky [67] discussed computational models of reinforcement learning and described his construction of an analog machine composed of components he called SNARCs (Stochastic Neural-Analog Reinforcement Calculators). Farley and Clark described another neural-network learning machine designed to learn by trial and error. In the 1960s the terms "reinforcement" and "reinforcement learning" were used in the engineering literature for the first time. Particularly influential was Minsky's paper "Steps Toward Artificial Intelligence” [65], which discussed several issues relevant to reinforcement learning, including what he called the credit assignment problem:

Hebb’s rule was further championed by early simulations using the IBM Type 704 Electronic Calculator in 1956 [84]. In this work it was also Hebb’s colleague, Peter Milner, that was cited for pointing out the importance of inhibitory connections with regard to the formation of cell assemblies. Milner suggested that the formation of cell assemblies could work if the synaptic weights in the network were excitatory (positive) but that the weights connecting neurons in the cell assemblies should be negative (The Perceptron inhibitory). Milner cited the biology, and the known balance of excitatory-inhibitory synapses as being significant. The result was partially successful with cell assemblies forming around the inputs to the network and nowhere else. However, interest in random network topologies and self-forming cell assemblies did not last.

#### 1.1.3 The Perceptron

Perceptrons was the generic name given by the psychologist Frank Rosenblatt to a family of theoretical and experimental artificial neural net models which he proposed in the period 1957±1962. Rosenblatt's work created much excitement, controversy, and interest in neural net models for pattern classification in that period and led to important models abstracted from his work in later years. Currently the names (single-layer) Perceptron and Multilayer Perceptron are used to refer to specific artificial neural network structures based on Rosenblatt's perceptrons. The intellectual context preceding Rosenblatt's work and summarizes the basic operations of a simple version of Rosenblatt's perceptrons. It also comments briefly on the developments in this topic since Rosenblatt.

Frank Rosenblatt [86], in 1958, introduced the single-layer perceptron with adjustable synaptic weights and a threshold output neuron. Rosenblatt effectively abandoned the idea of self-forming cell assemblies. He argued that the endeavor of analyzing random neuron topologies in the hope that they will yield something interesting or that the study of networks that have been hard-wired to implement a particular function were lacking in value. The original Perceptron topology [86] consisted of a layer of n input elements (referred to as a retina), which feed into a layer of m association units, that then connect to a single output neuron (see Figure

illustration not visible in this excerpt

Figure 1.5: Rosenblatt’s Perceptron (Rosenblatt, 1958) [86]

The association or predicate units A can implement any function of the input units x. Typically, the Perceptron computes a weighted sum of the inputs, subtracts a bias (b), and passes out one of two possible outputs according to the threshold defined as:

illustration not visible in this excerpt

Rosenblatt introduced an error correction rule (called the Perceptron learning rule) along with a remarkable theorem (Perceptron convergence theorem) that stated that if two classes are linearly separable the algorithm would converge to a solution in a finite number of iterations. The original perceptron learning procedure only adjusts the weights to the output neuron. The reason for this is that at this time no rationale had been found to adjust the weights connecting the input x and the association

units A. Error could be measured at the output but there was no way to evaluate the error originating from the previous layer. This was known as the credit assignment problem. The introduction of the adjustable synaptic weight for the first time heralded the exploration of analogue computations in neural networks. The multitude of neural network topologies and the various training algorithms that followed are characterised by the way in which the weight is adapted.

#### 1.1.4 Widrow & Hoffs Delta Rule

In 1960, Bernard Widrow and Ted Hoff [109], introduced the Least-Mean-Squares (LMS) algorithm, also known as the Delta rule or the Widrow-Hoff rule. The rule is similar to the Perceptron learning rule with one major difference. Whereas the Perceptron learning rule uses the output of the system after it has been put through the threshold function of Equation 1.3, the Delta rule uses the net weighted output instead. The Delta learning rule was developed to train ADALINE (ADAptive LINear Elements). ADALINE is similar to a Perceptron (see Figure 1.6) except that it uses a linear activation function instead of a threshold.

illustration not visible in this excerpt

The output y for the ADALİNE in Figure 1.6 is During training, for each presented input pattern xp there will be a desired output pattern tp. The actual output of the the network yp will differ from the target output by an amount — yp. The Delta rule implements an error function based on this difference (error) to modify the weights. As the name LMS suggests the error function for pattern p is based on the summed-squared error. The total error E is given by:

illustration not visible in this excerpt

Hence where 5 is the difference between the desired and actual output (t — y).

In 1962Widrow and his students came up with MADALINE (Multiple ADALINE) which utilised multiple ADALINE units [109]. The ADALINE units were still organised in a single layer but were executed in parallel. The multiple ADALINE units combined their outputs in a fixed way and are capable of all logical operations. The versatility of MADALINE made them arguably the most popular network topology of its era. MADALINE and the delta rule were famously used to eliminate echo in long distance telephone lines and satellite transmissions (something they are still used for today), but they have also been used for a whole host of other applications. Other applications include adaptive filtering, signal processing, antennas, inverse controls, noise cancellation and seismic signal processing for a full list).

#### 1.1.5 The Significance of the XOR Problem

In 1960 Marvin Minsky and Seymour Papert published a book called Perceptrons ([67M68]), in which they mathematically proved that single-layer perceptrons were only able to distinguish linearly separable classes of patterns. They showed that the single-layer perceptron was unable to solve the XOR problem. The XOR function highlights the perceptron’s inadequacy. The truth table for the XOR problem is shown in Table 1.1.

illustration not visible in this excerpt

Table 1.1: XOR Trut h Table

The XOR problem is the simplest linearly non-separable function. It can be thought of as two smaller patterns (the -1,1 and the 1,-1) contained within a larger pattern

(the inputs both -1 or 1). Figure 1.7 shows three standard logic functions. The first two, AND and OR are linearly separable (i.e. a line can be drawn between the two classes), the third, the XOR function is clearly not.

illustration not visible in this excerpt

Figure 1.7: Separable AND and OR and non-separable XOR

Depending on the presentation order of the data, and by allowing negative weights (Minsky and Papert only used positive weights), it is possible to solve the XOR problem with a single layer perceptron [71]. Nevertheless, the XOR problem does represent an exception to the perceptron convergence theorem for certain orderings of the input data. Minsky & Papert did state that the introduction of a hidden layer neuron would likely solve the problem. They accepted that perceptron architecture with the addition of a hidden layer neuron would be capable of solving the XOR problem, but conjectured that such a solution would be sterile by virtue of the perceptron losing its inherent simplicity. Such MLP architecture with the use of one hidden layer neuron is indeed capable of solving the XOR problem. Figure 1.8 (a) shows an example MLP solution [54].

illustration not visible in this excerpt

Figure 1.8: XOR Solution

The addition of a hidden layer neuron means that there are now three inputs to the single output neuron as opposed to two. Therefore the input space has been extended from 2-D to 3-D, with the third dimension being supplied by the output from the hidden layer neuron. The equation for the net input to the hidden layer neuron in the perceptron architecture of Figure1.7 (a) is given by:

illustration not visible in this excerpt

Table 1.2: XOR Perceptron Solution Results

It can be seen from Table 1.2 that the configuration of the weights, biases and the topology of the MLP do indeed generate the correct mapping for the XOR problem. Further proof to the existence of such a solution is provided by Figure 1.8 (b) [54]. In the figure the coordinates of the original inputs have been extended into 3-D, with the additional coordinate coming from the netj row of Table 1.2. It can now be seen

that there are many possible solutions of the kind shown by the linear plane (shaded in grey in Figure 1.8(b)). However, there was a penalty incurred by the extension of the single layer perceptron to an MLP. Notice that the input weight w1 and w2 are set to 1, this is because there is no longer a rationale for training these weights anymore, as there is as of yet no mechanism to relate these weights to the output error. This is of course the credit assignment issue again, which took nearly twenty years to solve for the MLP. Further, the MLP architecture in Figure 1.8 (a) can only be used for problems that can be solved with input weights equal to one, which of course is not every problem.

Minsky and Papert ([67], [68]) showed other weaknesses of the perceptron. Namely that if the size of the network increases then the time to train the network increases rapidly. This scalability issue was of particular interest at a time before the invention of hardware solutions such as accelerator cards. The pace of ever faster hardware devices and a deeper understanding of neural processes have mitigated this problem somewhat but the scalability issue raised by Minsky and Papert is still very much there. This coupled with the fact that at the time neural network researchers were competing for funding with researchers in symbolic AI meant that interest in neural network research virtually collapsed with only a few ’die-hard’ researchers continuing their study with fewer financial funds. Neural network research went on slowly, and as computers became more powerful, neural networks could solve problems that were previously intractable. The next section will begin by summarizing the terminology used so far. Then it will be demonstrated how the credit assignment problem was solved with the introduction of the back propagation algorithm, and, how the multitude of new training algorithms and network topologies rejuvenated neural network research.

### 1.2 Summary of Terminology

In this section, rather than adhering religiously to the chronology of developments of neural networks, only a selection of some of the major developments will be presented. Whilst it is not possible to mention all the developments of the state of the art, the topologies and learning strategies presented are aimed at providing an insight into the diversity of approaches. There will of course be many refinements and other valuable contributions that alas will be excluded from such a cursory examination.

The history described in Section 1.1 has seen the emergence of many network topologies, activation functions, and learning algorithms. Some time will now be taken to ’take stock’ and classify the developments. The early neuron models are similar in function, most having adjustable synaptic weights and biases. Another development was the introduction of the activation function.

#### 1.2.1 Activation Functions

There are many types of activation function. Activation functions in general are employed for various purposes. Linear activation functions for example are often used in output layers (as with the perceptron) to scale the output from hidden layers to the operating range of the desired output. The step (signum or Heaviside) function is used primarily for stability purposes, keeping outputs from successive layers in a network within a certain range. Figure 1.9 shows a sample of some of the most common activation functions.

illustration not visible in this excerpt

Figure 1.9: Various Activation Functions

The plots shown in figures 1.9 were generated using the built-in activation (transfer) functions of MATLAB’s neural network toolbox [100].

#### 1.2.2 Learning

Knowledge is represented in neural networks by the strength of synaptic connections between neurons. Learning is accomplished by adjusting the synaptic weights, hence altering synaptic efficacy. There are three main learning approaches, these are referred to as supervised, reinforcement and unsupervised learning.

Supervised learning is the most common kind of learning algorithm. It is where input and output pairs of training data are presented to the network. The error during training (difference between the input and desired output) is used to alter the weights until the desired objective has been achieved. Supervised learning is particularly useful for applications where a specific goal/objective is required.

Reinforcement learning is similar to supervised learning, in that there is some external input into the system that directly affects the altering of synaptic efficacy. However, whereas supervised learning may be viewed as being ’with a teacher’, reinforcement is regarded as being ’with a critic’. With supervision, the specific accuracy of the network in learning a desired input-output relationship is used.

Whereas with reinforcement, the critic only informs the network if the outcome is good or bad.

Unsupervised learning is modification of synaptic weights without any external teacher or critic input. Hebbian-type learning is an example of unsupervised learning where synaptic efficacy is altered as learning progresses in response to correlations of coincidental synaptic response. Other forms of unsupervised learning such as competitive self-organizing learning will be described in later sections.

In addition to the type of learning algorithm employed in training there are also two different ways in which to present the training data. Sequential training (often referred to as incremental training) is where each training sample (input output pair in the case of supervised learning) is presented to the network one after the other. Batch training is where the whole set of training data is presented in an epoch by epoch manner with resulting weight changes calculated using the whole set of training data.

After the research of Minsky & Papert illustrated the inadequacy of the state-of the- art, there was a period of disillusionment in neural network research that lasted into the 1980s. Several developments around this time sparked what has been alluded to as being the first neural network revolution. Arguably, there are two main reasons for the rejuvenation of the research area. The first was the work of Hopfield who introduced a new kind of network topology called the recurrent neural network. The second was the solution of the credit assignment problem for feed-forward networks which is outlined in the following section.

### 1.3 Backpropagation

Paul Werbos, in his 1974 PhD thesis [106], first demonstrated a method for training MLPs, essentially identical to Backpropagation but the work was largely ignored. Independently of Werbos, David Parker [74], in 1982, and Yann LeCun [55], in 1985, both published algorithms similar to Backpropagation, which were again ignored. David Rumelhart, Geoffrey Hinton, and Christopher Williams, in 1986, introduced Backpropagation, the first widely recognized learning algorithm for MLPs or neural networks generalized Delta Rule ([88], [89]).

Backpropagation is a gradient descent supervised learning algorithm. It is sometimes referred to as the generalized delta rule since it is derived from the Delta rule presented in Section 1.1.4 for linear functions and applied to the set of nonlinear activation functions. The term backpropagation refers to the manner in which the gradient is calculated for nonlinear multilayer networks. This algorithm makes it possible to train all the weights in an MLP network in order to minimize the error. Consider the three layer MLP network in Figure 1.10

Now the manner in which ƒ (net) is evaluated depends on which layer and in particular on which activation function is included in the layer. Here it is evaluated using the sigmoidal activation function:

Since backpropagation is a supervised training algorithm, learning involves presentation of an input-output training set. Equations 1.17 to 1.21 describe the propagation of the input data through each layer of the network. The resulting network output given by Equation 1.17 is then compared against the desired output. The difference is of course called the error. When there is an error, the error is then backpropagated through the network, the various weights and biases are changed by the error, and the whole process repeats until the error is zero or more practically when the error is below an acceptable level. The backpropagation rule is well documented and its derivation is not reproduced here because of this. Suffice to say that the following equations 1.22 to 1.24 outline the iterative procedure known as backpropagation or generalized delta rule. The weight update for the weights connecting the hidden and output layers is given Where v is the learning rate, 0; is the hidden neuron output and Sk can be evaluated using the chain rule to give: where Ok is the output neuron output, and tk is the target or desired output. The bias update Д0к is evaluated similarly and can be shown to be:

Furthermore, the credit assignment problem (see subsection 1.1.3) of determining the error from neurons in previous layers is now solved by recursively calculating the Ss from previous layers, the calculation of which is again facilitated by use of the chain rule, refer to Rumelhart et al. (1986) for more details.

#### 1.3.1 Limitations of Backpropagation

The theory around backpropagation is only valid for when the learning rate η is small. However, if η is too small the training will take a long time to converge to a minimum. Conversely, if the learning rate is too big, the learning regime will exhibit oscillatory behavior. The trick is to pick η large enough to produce rapid convergence but not too large so as to avoid oscillatory behavior. Another strategy for avoiding oscillatory behavior is to use a momentum term in the change of weight rule:

Backpropagation also has a tendency to get stuck in local minima. The addition of the momentum term makes this less likely. Backpropagation generated much renewed interest in neural network research. David Rumelhart, James (Jay) McClelland, and the PDP Research Group, in 1986, published Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volumes 1 and 2 ([88], [89]). and officially resuscitated neural network research. Terrence Sejnowski and Charles Rosenberg, in 1986, introduced NETtalk (Rosenberg and Sejnowski, 1986), a text-to-speech neural network system trained using Backpropagation, and one of the earliest practical applications of MLPs.

### 1.4 Self-Organizing Networks

Research in self-organizing networks began in the late 1960s [5] and early 1970s but the work was largely ignored at a time when interest in neural network research had lapsed. The foundations of self-organizing networks are based in biology (von der Malsburg, 1973) and arose as a consequence of neurobiologists constructing models of behaviour observed in the visual cortex. Previously, it was thought that the organization of neurons in the visual cortex was purely evolved.

#### 1.4.1 Competitive Learning

Perhaps the earliest basis for a self-organized network was based on the idea of competition between neurons. It was intuitively obvious that any mechanism that promotes neuron selectivity and hence diversifying the routing of information in networks would be beneficial to learning. Rumelhart and Zipser arguably proposed the simplest model of competitive learning in networks [89]. The elements of competitive learning involve a set of uniform neurons that have randomly distributed synaptic weights that are designed to respond differently to different input patterns, as well as a learning mechanism that enables the neurons to compete for the right to respond to the different input patterns. Figure 1.11 illustrates the simple two-layer topology.

The figure shows that there will be a neuron in the output layer that responds maximally to a particular input pattern i.e. it will have the maximum activation over the layer:

illustration not visible in this excerpt

Figure 1.11: Competitive Learning

Only the output neuron with the maximum activation receives weight increments to the weight connecting it to the input layer. This is referred to as the winner-take-all principle. There are other means of determining the maximum activation, for example calculating the dot product between the inputs and the weights; however, the end result is the same.

### 1.5 Radial Basis Function Networks

RBF networks were developed in the late 1980s as an alternative to MLP networks. An RBF network is a feed-forward network that employs radial basis functions (typically Gaussians) as activation functions. The network consists of an input layer, a hidden layer with non-linear RBF activation functions and a linear output layer usually consisting of a single output neuron. Figure 1.12 shows the topology of a typical RBF network.

The idea of Radial Basis Function (RBF) Networks derives from the theory of function approximation. We have already seen how Multi-Layer Perceptron (MLP) networks with a hidden layer of sigmoidal units can learn to approximate functions. RBF Networks take a slightly different approach. Their main features are:

1. They are two-layer feed-forward networks.

2. The hidden nodes implement a set of radial basis functions (e.g. Gaussian functions).

3. The output nodes implement linear summation functions as in an MLP.

4. The network training is divided into two stages: first the weights from the input tohidden layer are determined, and then the weights from the hidden to output layer.

5. The training/learning is very fast.

6. The networks are very good at interpolation.

The RBF topology circumvents the credit assignment issue of MLP networks (see Section 1.1.5) in that they implement learning typically in the linear output layer alone. In the standard RBF implementation ([18], [69]) the network output is described formally as:

Where y is the output node and ®[x — χ/j is the activity of the hidden node j . The hidden node is centered on the vector x,. x is the input vector and u-,,; are the weights from the hidden layer to the linear output layer. For Gaussian RBFs, the activation of the hidden node j is given by:

Where σ is the width of the Gaussian. The utilisation of a linear output layer is justified by Cover’s theorem [22] on the separability of patterns. The theorem states that a pattern classification problem which is cast into a higher dimension is more likely to be capable of solving non-linear classification tasks. RBF networks are acknowledged to be universal function approximators ([78], [73]), in that an RBF network can approximate arbitrarily well any multivariate continuous function on a compact domain if a sufficient number of radial basis function units are used.

Design methods for RBF networks include the following:

1. Random selection of fixed centers [18]

2. Self-organized selection of centers [69]

3. Regularization RBF [77]

4. RBF configuration using GAs [12]

Whichever approach is used, of which the ones mentioned are a very small subset, it is generally acknowledged that the number of RBF nodes should be less than the number of data points in the training set, in order to prevent over fitting and improve generalization. One way to tackle the issue of poor generalization is to choose RBF functions that are ’case indicators’ [19]. This can be achieved by normalizing Gaussian RBFs. Figure 1.13 illustrates the difference between standard and normalized Gaussian RBFs.

In a normalized RBF network the generalization capabilities are improved. This is because the RBFs cover the entire input space and produce significant outputs even for input vectors that lie far from the RBFs. In this way, normalized RBF networks are functionally equivalent to fuzzy logic systems (FLSs) ([45], [46]) (for FLS background see the following section) where the rule-firing strength is also normalized. Moody & Darken [69] proposed that normalization of RBFs should be performed at the hidden nodes before summation at the output. This is a non-local operation, requiring hidden layer nodes to ’know’ about the outputs of other hidden nodes. Such a nonlocal approach is potentially at odds with biological plausibility in this sense. Contrastingly, in the approach of Bugmann [19], normalization is performed at the output nodes, since output nodes receive input from all the hidden nodes, and hence preserve the locality requirement of biological plausibility.

### 1.6 Hybrid Topologies

In this section it will be shown how other branches of information processing may be harnessed to improve the capabilities of ANNs. From the multitude of information processing techniques, two popular approaches, namely EAs and FLSs are selected for discussion. Whilst there are many other techniques such as parallel computing and belief networks that could be discussed, EAs and FLSs represent two very different paradigms that improve the capabilities of neural networks in very different ways. EAs and FLSs are also employed in this research and as such are mentioned in this review. These two disciplines span huge research areas in their own right, and reviewing them in detail is outside the scope of this research. Instead, this section will investigate how, together with neural networks, they can be used to create computationally powerful hybrid topologies.

#### 1.6.1 Neuro Evolutionary Systems

At their heart, EAs whether they are specifically GAs, evolutionary strategies [81] or evolutionary programming systems are essentially search engines based on Darwin’s evolutionary theory [24]. These different schemes of evolutionary modeling were developed independently of each other by many researchers in different parts of the world, the earliest work goes back to the 1950s. The different approaches employ different representations of chromosomes (possible solutions) for varying types of problems. For example, some representations concentrate on the importance of ordering of events, or the hierarchy of possible solutions. Nevertheless, stronger research links, increased collaboration between researchers, and the development of mathematical theory of EAs has led to the recognition of a unified approach where representations, coding schemes, selection, mutation and recombination are employed to fit the particular problem at hand.

EAs are different to many other types of global optimization techniques such as simulated annealing [49] and hill climbing [90] in that they are intrinsically parallel. Most optimization algorithms develop one solution at a time and if they produce an unsatisfactory one they ’go back to the drawing board’, typically re-initializing randomly generated initial conditions and starting again. EAs on the other hand, have a population of potential solutions to the problem and multiple offspring capable of searching the solution space in many different directions at once. There are four major components to any EA, they are parallelism, selection, mutation and recombination that using the principle of survival of the fittest, work together to produce optimal or near-optimal solutions. In the same way that a well-adapted species dominates all other species in its surroundings, an optimal solution dominates all other solutions in a solution space. Initially a diverse population of potential solutions is created (exploiting parallelism), casting a ’net’ over the fitness landscape [62]. To borrow the well-known analogy from Koza [53], this is akin to a ’team of parachutists dropping onto a problem’s solution space, each one of whom has been given orders to find the highest peak’. Mutation allows each individual to explore its immediate vicinity, whereas selection forces progress (depending on selection pressure), guiding offspring uphill to more promising parts of the solution space (Holland, 1992). Figure 1.14 illustrates this evolutionary cycle.

illustration not visible in this excerpt

Figure 1.14: The Evolutionary Cycle