## N

where wi are the connection weights and xi are the inputs, with x0 often chosen to be a constant equal to 1, called the bias. The sgn() function is equal to 1 if its argument is greater than or equal to 0 and — 1 if it is less than 0. The perceptron is a linear network in that the output ''decision'' is linearly dependent on the input. The perceptron classifies its input, by outputting either a 1 or a —1, on the basis of how it projects onto its connection weights. Therefore, determination of the values for the weights is important for giving the perceptron its ability to classify. The connection weights are usually learned via supervised training.

Training ANNs involves formulating and minimizing an objective function. For the perceptron, the objective function, H, is given by

where S is the set of examples that are misclassified, given the connection weights. One can see that ^wx <0 if xi is misclassified. Thus, the minimum of H is 0.

Minimization of the objective function for the training data requires taking the gradient of H with respect to the connection weights and then updating the weights by moving in the direction opposite to the gradient (e.g., gradient descent). This is simply dH (w)

and the update rule for the perceptron from time step t to t+1 becomes dH(w)

Figure 3 Artificial neural network architectures: (a) perceptron, (b) multilayer preceptron.

After learning on the training data, new data are input to the perceptron network with the goal being that it gives correct results on the new data, a property called "generalization." The perceptron generated much excitement over its ability to perform "brainlike" computation. This excitement, however, was squelched in 1969 when Marvin Minsky and Seymour

Papert published their book Perceptrons, which clearly outlined the computational limitations of the perceptron architecture. The single-layer architecture made it impossible for the network to solve anything but linearly separable problems. A classic example in which the perceptron had difficulty was the exclusive-OR (XOR) problem shown in Fig. 4. Minsky and

Figure 4 Linear versus nonlinear separable classification. (a) Linearly separable classification problem in x1, x2 feature space. A perceptron can learn the function fx) for solving linearly separable problems. (b) The exclusive-OR (XOR) problem is not linearly separable. In the XOR problem, data in x1, x2 feature space are distributed in much the same way as the exclusive-OR function in Boolean logic. A perceptron is not capable of learning a function fx) that classifies all the X's and O's correctly. An MLP, however, can learn fx) because the output unit combines the decision surfaces generated by all of the hidden units, thereby ''piecing together'' a complex decision boundary separating all X's from O's.

Figure 4 Linear versus nonlinear separable classification. (a) Linearly separable classification problem in x1, x2 feature space. A perceptron can learn the function fx) for solving linearly separable problems. (b) The exclusive-OR (XOR) problem is not linearly separable. In the XOR problem, data in x1, x2 feature space are distributed in much the same way as the exclusive-OR function in Boolean logic. A perceptron is not capable of learning a function fx) that classifies all the X's and O's correctly. An MLP, however, can learn fx) because the output unit combines the decision surfaces generated by all of the hidden units, thereby ''piecing together'' a complex decision boundary separating all X's from O's.

Papert's seminal work was a blow to the field, which lay largely dormant until the 1980s, when the development of a learning rule for training multilayer perceptrons (MLPs) was formulated.

MLPs, shown in Fig. 3b, consist of several layers of modifiable connection weights. Neural units in layers that are not directly connected to the output are called ''hidden units.'' Connections are usually feedforward, with units in layer N connected to all units in layer N— 1. The multilayer architecture enables the MLP to solve problems that are not linearly separable. In fact, the eminent Russian mathematician Andrei Kolmo-grov proved that any continuous function can be implemented in an MLP with a sufficient number of hidden units. An important development for the field was the publication in 1986 by David Rumelhart, Geoffrey Hinton, and Ronald Williams of the back-propagation learning rule for training MLPs. The back-propagation learning rule was able to solve the classic credit-assignment problem for MLPs. The formulation of the back-propagation learning rule follows the same basic structure as the perceptron learning rule, namely, defining an objective function based on the training error, differentiating the objective function with respect to the connection weights, and then generating a weight update equation based on gradient descent. One of the important differences is that, because for an MLP the outputs are not directly connected to the inputs, the back-propagation learning rule requires the use of the chain rule of calculus to compute the required derivatives. The publication of the back-propagation algorithm started a flurry of development in the field, with ANNs being built that could learn to recognize printed characters to networks that could be trained to ''talk.'' For example, the NETtalk network, consisting of 203 input units, 80 hidden units, 26 output units, and a total of 18,629 connections, learned to translate written text into phonemes that could be pronounced by a speech synthesizer. Today, MLP architectures and the back-propagation learning rule continue to evolve with new variants having better convergence properties, leading to more powerful networks.

ANNs are not limited to feedforward connection architectures, particularly considering how feedback connectivity can give rise to interesting dynamic properties in biological neural networks. Recurrent networks have architecture similar to feedforward MLPs, except that there are feedback connections between layers. John Hopfield, motivated by concepts in statistical physics, developed one particularly interesting class of recurrent networks. In a Hopfield network, all neurons are connected to one ano-ther—the network is termed ''fully connected.'' Neuron responses are represented as binary states (1/0 or 1/ — 1), analogous to the spin of a particle (+ 2 or — |) in physics. Interactions between neurons, via their connectivity, influence their individual states much as interactions between particles influence their collective spins. As in the physical system, a network architecture, with a particular set of connections, will have an ''energetically favorable'' equilibrium state, which can be viewed as an attractor in a dynamic system. Training a Hopfield network involves learning the set of connections that makes a particular binary state of neurons an attractor. Input of a similar pattern, or the original pattern corrupted by noise, results in the network activity eventually converging to the closest attractor, with the output being the stored pattern. Hopfield networks, therefore, implement ''associative memories'' with properties similar to what has been observed in the hippocampus.

Many subtle issues are related to the construction, training, and evaluation of ANNs. For instance, there are issues on how to objectively select the best set of features for input to an ANN and even on how to select the best ANN model itself, e.g., what is the optimal number of layers, neural units, and connection weight given a particular data set and problem. A related issue is the ''bias-variance trade-off,'' which can be thought of as the trade-off between the expected error of the network, d (the bias), and the variation of the error (the variance) for different subsets of training data. Ideally, one would like to minimize both the bias and variance; however, these two terms tend to vary diametrically. An overly simple network, with few parameters, will tend to have a large error across training subsets (high bias); however, the value of this error will not vary considerably across training subset (low variance). An overly complex network will tend to estimate the training data well (low bias); however, this estimate will likely vary considerably across different training subsets (high variance). The challenge is to find the optimal network that minimizes the combination of bias and variance. Readers interested in more detail are referred to Bishop (1995) and Duda, Hart, and Stork (2001).

Cognitive neuroscientists have used ANNs as tools for studying brain function, particularly for modeling high-level cognitive processing. Martha Farah and Jim McClelland have built a connectionist model composed of fewer than 200 neurons to study the organization of semantic memory and its role in agnosia, which is a failure in object recognition. They use their model to test two hypotheses, the first being that semantic memory is organized into categories and the alternative being that the organization is based on object properties. For instance, a category-based organization might be based on living versus nonliving things, whereas a property-based system would organize on the basis of the visual and functional attributes of objects. By using a network model based on object-property organization, Farah and McClelland simulated "lesioning" their ANN by deactivating a percentage of the units in the semantic layer. These simulated lesion studies showed that loss of function in the visual and functional semantic units results in a specific categorical agnosia, with lesions to the visual units causing a loss in memory of living things and lesions to functional units resulting in a loss in memory of nonliving things. These results, consistent with neu-ropsychological literature, demonstrate that category-specific deficits are an emergent property of a network that is organized based on object properties.

ANNs have been used to solve very complex signal processing problems. The well-known cocktail party problem described earlier is the problem of unmixing individual speakers and other acoustic sources from a set of microphone signals. If little or no information is available about the sources or the environment in which the recordings took place, then the problem is often termed blind source separation (BSS). One technique for BSS that has proven very successful is independent component analysis (ICA). The goal of ICA is to find the mixture components that are most statistically independent. In 1995, Tony Bell and Terry Sejnowski of the Salk Institute showed that ANNs are very good at implementing ICA and, thus, could be used for BSS. By using a single-layer neural network with sigmoidal output units and trained with an entropy-based unsupervised learning rule, they demonstrated impressive results in separating acoustic sources. Since Bell and Sejnowski's original work, many researchers have developed neural network models for BSS applications. In neuroimaging, for example, electroencephalography (EEG) and magne-toencephalography (MEG) are being used to record millisecond resolution electromagnetic signals related to brain activity. High-density EEG and MEG systems, having more than 100 sensors recording activity across the brain, pick up a mixture of electromagnetic signals originating from different populations of neurons. To analyze these signals, it is often useful to separate them into independent sources, such as separating signals from the auditory cortex from signals originating in the visual cortex. ANNs are being used to recover independent sources in EEG and MEG signals for improved brain activity analysis.