46
IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 2, NO. I, FEBRUARY 1994
Regular Issue Papers Reinforcement Structureparameter Learning for NeuralNetworkBased Fuzzy Logic Control Systems
ChinTeng Lin and C. S . George Lee
Abstruct This paper proposes a reinforcement neuralnetworkbased fuzzy logic control system (RNNFLCS) for solving various reinforcement learning problems. The proposed RNNFLCS is constructed by integrating two neuralnetworkbased fuzzy logic controllers (NNFLC¡¯s), each of which is a connectionist model with a feedforward multilayered network developed for the realization of a fuzzy logic controller. One NNFLC performs as a fuzzy predictor, and the other as a fuzzy controller. Using the temporal difference prediction method, the fuzzy predictor can predict the external reinforcement signal and provide a more informative internal reinforcement signal to the fuzzy controller. The fuzzy controller performs a stochastic exploratory algorithm to adapt itself according to the internal reinforcement signal. During the learning process, both structure learning and parameter learning are performed simultaneously in the two NNFLC¡¯s using the fuzzy similarity measure. The proposed RNNFLCS can construct a fuzzy logic control and decisionmaking system automatically and dynamically through a reward/penalty signal (i.e., a ¡°good¡± or ¡°bad¡± signal) or through very simple fuzzy information feedback such as ¡°high,¡± ¡°too high,¡± ¡°low,¡± and ¡°too low.¡± The proposed RNNFLCS is best applied to the learning environment, where obtaining exact training data is expensive. The proposed RNNFLCS also preserves the advantages of the original NNFLC, such as the ability to find proper network structure and parameters simultaneously and dynamically and to avoid the rulematching time of the inference engine in the traditional fuzzy logic systems. Computer simulations were conducted to illustrate the performance and applicability of the proposed RNNFLCS.
I. INTRODUCTION
OST of the supervised and unsupervised learning algorithms for neural networks require precise training data sets for setting the link weights and link connectivity of the neurons for various applications [l], [2]. For some realworld applications, precise data for trainingfleaming are usually difficult and expensive, if not impossible, to obtain. For this reason, there has been a growing interest in reinforcement learning algorithms for neural networks [l]. In this
Manuscript received July 16, 1992; revised February 22, 1993. This work was supported in part by the National Science Foundation under Grant CDR 8803017 to the Engineering Research Center for Intelligent Manufacturing Systems and a Grant from the Ford Foundation. C.T. Lin is with the Department of Control Engineering, National ChiaoTung University, Hsinchu, Taiwan, R.O.C. C. S.G. Lee is with the School of Electrical Engineering, h r d u e University, West Lafayette, IN 47907. IEEE Log Number 9213143.
M
paper, we are extending our previous work on neuralnetworkbased fuzzy logic control systems (NNFLC) [3], [4] to the reinforcement learning problem. For the reinforcement learning problem, training data are very rough and coarse, and are just ¡°evaluative¡± as compared with the ¡°instructive¡± feedback in the supervised learning problem. Training a network with this kind of evaluative feedback is called reinforcement learning, and this simple evaluative feedback, called reinforcement signal, is a scalar. In addition to the roughness and noninstructive nature of the reinforcement signal, a more challenging problem to the reinforcement learning is that a reinforcement signal may only be available at a time long after a sequence of actions has occurred. To solve the long timedelay problem, prediction capabilities are necessary in a reinforcement learning system. Reinforcement learning with prediction capabilities is much more useful than the supervised learning schemes in dynamic control problems and artificial intelligence, since the success or failure signal might only be known after a long sequence of control actions. From the biological and cognitive points of view, reinforcement learning is much closer to the modern animal learning theory [ 5 ] than the supervised learning. This is also true to the learning of many highlevel intelligent skills such as how to drive a car. The development of reinforcement learning can be roughly divided into two stages. The first stage began in the 1950¡¯s, when mathematical psychologists developed computational models to explain the learning behavior of animals and human beings [6].They viewed learning as stochastic processes and developed the socalled stochastic learning model. At almost the same time, cyberneticians and control theorists made independent efforts on the study of stochastic learning. Their work basically used deterministic automata as a model for learning systems operating in stationary random environments, and later the model was generalized to use stochastic automata [7]. More details on the stochastic learning automata can be found in [9]. At this stage, most of the learning models were ¡°nonassociative,¡± since there was no input to the learning system except the reinforcement signal. A typical example is the twoarmed bandit problem [8]. Representative of the second stage development of reinforcement learning is the associative reinforcement learning, in which people tried to
10634706/94$04.00 0 1994 IEEE
LIN AND LEE FUZZY LOGIC CONTROL SYSTEMS
41
associate an input pattem with output pattems according to a reinforcement signal. This was stimulated by the theory proposed by Klopf [lo]. Inspired by Klopf¡®s work and earlier simulation results [ 111, Barto and his colleagues used neuronlike adaptive elements to solve difficult learning control problems with only reinforcement signal feedback [ 121. They also proposed the associative reward penalty ( A R  P )algorithm for adaptive elements called ARP elements [ 131, and proposed several generalizations of the ARP algorithm [ 141. Williams formulated the reinforcement learning problem as a gradientfollowing procedure [15], and identified a class of algorithms, called REINFORCE algorithms, that possess the gradient ascent property. However, these algorithms still do not include the full A R  ~ algorithms. In this paper, we shall apply the technique of associative reinforcement learning to our proposed reinforcement NNFLC learning system. The proposed learning system can construct a fuzzy logic control and decision system automatically and dynamically through a reward penalty signal (i.e., good/bad signal) or through very simple fuzzy feedback information such as ¡°high,¡± ¡°too high,¡± ¡°low,¡± and ¡°too low.¡± Moreover, there is a possibility of a long time delay between an action and the resulting reinforcement feedback information. To achieve the goal of solving reinforcement learning problems in fuzzy logic systems, a Reinforcement NeuralNetworkBased Fuzzy Logic Control System (RNNFLCS) is proposed which consists of two closely integrated NNFLC¡¯s. One NNFLC, the action network, is used for the fuzzy logic controller, it can choose a proper action or decision according to the current input vector. Its functions are the same as those proposed in [3], [4], and the major difference is that there is no ¡°teacher¡± to indicate output errors for the action network to learn in the reinforcement learning problem. The other NNFLC, the evaluation network, is used as the fuzzy predictor, and it performs the single or multistep prediction of the scalar extemal reinforcement signal. The fuzzy predictor provides the action network with more informative and beforehand intemal reinforcement signals for learning. Structurally, these two NNE C ¡¯ s share the first two layers of the original NNFLC in [3]; that is, they use the same distributed representation of input pattems. This representation is the overlapping type and is dynamically adjustable through the learning process. Associated with the proposed RNNFLCS is the reinforcement structure/parameter learning algorithm, which uses the temporal difference technique on the evaluation network to decide the output errors for either the single or multistep prediction. With the knowledge of output errors, the online supervised structure/parameter learning algorithm developed in [4] can be applied to train the evaluation network to obtain the proper membership functions and fuzzy logic rules. For the action network, the reinforcement structure/parameterlearning algorithm allows its output nodes to perform stochastic exploration. With the intemal reinforcement signals from the fuzzy predictor, the output nodes of the action network can perform more effective stochastic searches with a higher probability of choosing a good action as well as discovering its output errors. Again, after finding the output errors, the whole action network can be trained by the online learning algorithm described
in [4]. Thus, the proposed reinforcement structure/parameter learning algorithm basically utilizes the techniques of temporal difference, stochastic exploration, and the online supervised structure/parameter learning algorithm [4]. It can determine the proper network size, connections, and parameters of an RNNFLCS dynamically through an extemal reinforcement signal. Moreover, learning can proceed even in the period without any extemal reinforcement feedback. The RNNFLCS also maintains the humanunderstandablestructure of the NNFLC, such that IFTHEN type expert knowledge can be easily incorporated into the fuzzy logic controller or the fuzzy predictor, which is basically a model of the environment or the controlled plant. After learning, the action network becomes an independent fuzzy logic controller which can be used to control the plant in the original environment. In Section 11, the basic structure and functions of our previously proposed NNFLC are described [3], 141. The structure of the proposed NNFLCS and the corresponding reinforcement structure/parameterleaming algorithm with singlestep prediction capability are presented in Section I11 to solve simpler reinforcement learning problems. In Section IV, the multistep fuzzy predictor is proposed to perform multistep prediction in more complex reinforcement learning problems in which there is a long time delay between an action and the resultant reinforcement signal. In Section V, the cartpole balancing problem is simulated to demonstrate the capabilities of the proposed RNNFLCS. Finally, conclusions are summarized in Section VI. 11. NEURALNETWORKBASED FUZZY LOGICCONTROLLER (NNFLC) This section introduces the structure and functions of our previously proposed NeuralNetworkBased Fuzzy Logic Controller (NNFLC) [3], [4], which is a basic component of the proposed RNNFLCS. The learned NNFLC functions as a connectionist neuralnetworkbased fuzzy logic control and decisionmaking system. Fig. 1 shows the basic configuration of a fuzzy logic controller which is composed of three major components: fuzzifier, fuzzy rule base and inference engine, and defuzzifier. The fuzzifier performs the function of fuzzification that converts input data from an observed input space into proper linguistic values of fuzzy sets through predefined input membership functions. The rule base consists of a set of fuzzy logic rules in the form of ¡°&THEN¡± to describe the control policy of expert knowledge. The inference engine is to match the output of the fuzzifier with the fuzzy logic rules and perform fuzzy implication and approximate reasoning to decide a fuzzy control action. Finally, the defuzzifier performs the function of defuzzification to yield a nonfuzzy (crisp) control action from an inferred fuzzy control action through predefined output membership functions. More detailed descriptions of the concepts and definitions of a fuzzy logic controller can be found in [3], [4], [ 161, [ 171. A major problem of designing a fuzzy logic controller is determining the proper input/output membership functions and fuzzy logic rules. Based on the basic structure and concepts of the fuzzy logic controller, an NNFLC with
48
IEEE TRANSACTIONS ON FUZZY SYSYEMS, VOL. 2. NO. 1, FEBRUARY 1994
X
Fuzzifier
Engine
I
Plant
or outputs
states
I
L
          ¡¯
connections represented by weight values from other units and fanout of connections to other units (see Fig. 3). Associated with the fanin of a unit is an integration function, f, which serves to combine information, activation, or evidence from other nodes. This function provides the net input for this node: net

Fig. 1. General model of a fuzzy logic controller and decision making system.
a connectionist structure has been proposed [3] to realize the traditional fuzzy logic controller with learning abilities. The NNFLC is a feedforward multilayered network that integrates the basic elements and functions of a traditional fuzzy logic controller (e.g., membership functions, fuzzy logic rules, fuzzification, defuzzification, and fuzzy implication) into a connectionist structure which has distributed learning abilities to learn the input/output membership functions and fuzzy logic rules. Fig. 2 shows the structure of our NNFLC, which is described in [3]. The system has five layers. Nodes at layer one are input nodes (linguistic nodes) which represent input linguistic variables. Layer five is the output layer. Nodes at layers two and four are term nodes and act as membership functions to represent the terms of the respective linguistic variable. Actually, a layertwo node can be either a single node that performs a simple membership function (e.g., a triangularshaped function or a bellshaped function) or multilayered nodes (a subneural net) that perform a complex membership function (e.g., in an acoustic cue detector [4]). Hence, the total number of layers in this connectionist model could be more than five. Each node at layer three is a rule node which represents one fuzzy logic rule. Thus, all layerthree nodes form a fuzzy rule base. Links at layers three and four function as a connectionist inference engine [3], [4], which avoids the rulematching process. Layerthree links define the preconditions of the rule nodes, and layerfour links define the consequences of the rule nodes. Therefore, for each rule node, there is at most one link (perhaps none) from some term node of a linguistic node. This is true both for precondition links and consequent links. The links at layers two and five are fully connected between linguistic nodes and their corresponding term nodes. The arrow on the link indicates the normal signal flow direction when this network is in use. We shall later indicate the signal propagation, layer by layer, according to the arrow direction. Signals may flow in the reverse direction in the learning process, as discussed below in Sections 111 and IV. With this fivelayered structure of the proposed connectionist model, a node¡¯s basic functions can be defined. A typical network consists of a unit which has some finite fanin of
input = f ( @ ,U ! , . . . , U:; wf, .. . , w,¡±) (1) tug,
where U: represents an ith input signal from the kth layer, w : represents the ith link weight of the kth layer, the superscript k indicates the layer number, and p represents the number of a node¡¯s input connections. This notation will also be used in the following equations. A second action of each node is to output an activation value as a function of its net input: output = of = a f ) (
(2)
where a(.)denotes the activation function. For example (in standard form),
Y
We shall next describe the functions of the nodes in each of the five layers of the proposed connectionist model. .Layer 1: The nodes in this layer transmit input values directly to the next layer. That is,
f =ut
and
a = f.
(4)
From (4), the link weight at layer one (wd) is unity. .Layer 2 : If we use a single node to perform a simple membership function, then the output function of this node should be this membership function. For example, for a bellshaped function
where m;j and a;j are, respectively, the center (or mean) and the width (or variance) of the bellshaped function of the jth term of the ith input linguistic variable zi. Hence, the link weight at layer two can be interpreted as m ; j . If we use a set of nodes to perform a membership function, then the function of each node can be in the standard form as (3),
LIN AND LEE FUZZY LOGIC CONTROL SYSTEMS
49
r
A
Fig. 2.
F¡¯roposed neuralnetworkbasedfuzzy logic controller (NNFLC)
Layer k

The link weight in layer three (w,¡±) is then unity. Other possibilities for performing the fuzzy AND operation are ¡°product¡± or ¡°softmin¡± operator (a soft version of the min operator [33]). Although these operators require more computations than the min operator, they are differentiable and suitable for the derivation of a learning algorithm. .Layer 4: The links at layer four should perform the fuzzy OR operation to integrate the fired rules which have the same consequent:
P
f
=c u q
i=l
and
a = min(1, f).
(7)
U:
Basic Structure of a Node
Fig. 3. Basic structure of a node in a neural network.
Hence, the link weight wf = 1. .Layer 5: The nodes in this layer transmit the decision signal out of the network. These nodes and the layerfive links attached to them act as the defuzzifier. If mf¡¯¡¯s and u ; ~ ¡¯ s are the centers and the widths of the membership functions, respectively, then the following functions can be used to simulate the center o area defuzzification method [ 161, [ 171: f
and the whole subnet is trained offline to perform the desired membership function by a standard learning algorithm (e.g., backpropagation [2]). .Layer 3: The links in this layer are used to perform precondition matching of fuzzy logic rules. Hence, the rule nodes should perform the fuzzy AND operation
f=
w:~u: = C ( m i j g i j ) u : anda =
~
.
aiju:
(8)
f = min(u7, U:,
. . . , U:)
and
a = f.
(6)
Here the link weight at layer five (w;¡¯) is mijaij. Two complementary learning schemes were proposed to set up the NNFLC in [3] and [4]. The online supervised learning algorithm performs very well when the training data are available online [4], while the twophase hybrid learning algorithm is superior when sets of training data are available
50
[EEE TRANSACTIONS ON FUZZY SYSYEMS, VOL. 2, NO. 1, FEBRUARY 1994
offline [3 1. However, both learning schemes require precise training data to indicate the exact desired output, and then use the precise training data to compute the output errors for training the whole network. Unfortunately, such detailed and precise training data may be very expensive or even impossible to obtain in some realworld applications because the controlled system may only be able to provide the learning algorithm with a reinforcement signal such as a binary decision of right/wrong of the current controller/decision maker. To train a network with this kind of evaluative feedback, two NNFLC¡¯s need to be integrated into the structure of the proposed RNNFLCS with the corresponding reinforcement learning algorithm developed in the following sections. One NNFLC in the proposed R¡±FLCS functions as a fuzzy controller, and the other NNFLC as a fuzzy predictor. The reinforcement learning algorithm combines the structure learning and the parameter learning to determine optimal centers (mij¡¯s) and widths (aij¡¯s) of the term nodes in layers two and four. At the same time, it will learn fuzzy logic rules by deciding the connection types of the links at layers three and four: that is, the precondition links and consequent links of the rule nodes. All these learning algorithms will be performed on both NNFLC¡¯s simultaneously and only conducted by a reinforcement signal feedback from the extemal environment.
PREDICTOR
Output Membership
RULE
M ATCHINC
111. STRUCTURE/PARAMETER LEARNING ALGORITHM FOR THE R¡±FLCS WITH A SINGLESTEP FUZZY PREDICTOR
Fig. 4. Proposed reinforcement neuralnetworkbased fuzzy logic control system (RNNFLCS).
Unlike the supervised learning problem in which the correct stochastic function of the input produced by the environment ¡°target¡± output values are given for each input pattem to and the output it receives from the network. There are three instruct the network¡¯s learning, the reinforcement learning classes of reinforcement learning problems. First, for the problem has only very simple ¡°evaluative¡± or ¡°critic¡± in simplest case, the reinforcement signal is always the same formation instead of ¡°instructive¡± information available for for a given input/output pair; hence, the network can learn learning. In the extensive case, there is only a single bit of a definite input/output mapping. Moreover, the reinforcement information to indicate whether the output is right or wrong. signals and input patterns do not depend on previous network Fig. 4 shows how a network and its training environment output. For example, the parity learning problem and the interact in a reinforcement learning problem. The environment symmetry learning problem 1181 are in this class. Second, supplies a timevarying vector of input to the network, receives in a stochastic environment, a particular inpudoutput pair its timevarying vector of output/actions, and then provides determines only the probability of positive reinforcement. a timevarying scalar reinforcement signal. In this paper, However, this probability is fixed for each inpudoutput pair the reinforcement signal r ( t ) can be one of the following and, again, the reinforcement signal and input sequence do forms: 1) a twovalued number, r ( t ) E {1, l}, such that not refer to past history. This class includes the nonassociative r ( t ) = 1 means ¡°a success¡± and r ( t ) = 1 means ¡°a failure¡±; reinforcement learning problem, in which there is no input, and 2) a multivalued discrete number in the range [1,1], for we need to determine the best output pattern with the highest example, r ( t ) E {1, 0.5,0,0.5, l } which corresponds to probability of positive reinforcement from only a finite set different discrete degrees of failure or success; or 3) a real of trials. A typical example is the twoarmed bandit problem number, r ( t ) E [1,1], which represents a more detailed and [8]. Third, for the most general case, the environment is itself continuous degree of failure or success. We also assume that govemed by a complicated dynamical process, and both the r ( t ) is the reinforcement signal available at time step t and reinforcement signal and input patterns may depend on the past is caused by the input and actions chosen at time step t  1 network output. For example, in a chess game, the environment or even affected by earlier input and actions. The objective is actually another player, and the network only receives a of learning is to maximize a function of this reinforcement reinforcement signal (win or lose) after a long sequence of signal, such as the expectation of its value on the upcoming moves. time step or the expectation of some integral of its values To resolve the three different classes of reinforcement learnover all future time. ing problems, a new structure, called the reinforcement neuralThe precise computation of the reinforcement signal highly networkbased fuzzy logic control system (€2¡±FLCS), is depends on the nature of the environment and is assumed to be proposed. The proposed RNNFLCS, as shown in Fig. 4, unknown to the learning system. It could be a deterministic or integrates two NNFLC¡¯s into a learning system: one NN
LIN AND LEE: FUZZY LOGIC CONTROL SYSTEMS
51
FLC for the fuzzy controller and the other NNFLC for to be some source of randomness in the manner in which the fuzzy predictor. These two NNFLC¡¯s share the same output actions are chosen by the action network such that the layers 1 and 2 and have individual layer 3 to layer 5 , space of possible output can be explored to find a correct value. which are not clearly shown in the fuzzy predictor in Fig. Thus, the output nodes (layer 5 ) of the action network are now 4. Each network has exactly the same structure as shown in designed to be stochastic units which compute their output as Fig. 2. In other words, the fuzzy controller (action network) a stochastic function of their input. The functions of nodes and the fuzzy predictor (evaluation network) share the same in the other layers of the action network remain unchanged as distributed representation of input states by using the same described in Section 11. Such an approach has also been used in input membership functions (i.e., the same fuzzifier), but they other reinforcement learning algorithms [ 121[15], [ 181[20] have independent fuzzy logic rules (a different rule base and and is consistent with the closely related theory of stochastic decisionmaking process) and different output membership learning automata[9]. functions (a different defuzzifier). The action network can In our learning algorithm, the gradient information, have multiple output as shown in Fig. 2, although only is also estimated by the stochastic exploratory method [19]. one output node is shown in Fig. 4 In the multioutput In particular, the intuitive idea behind the multiparameter . case, all the output nodes of the action network receive distributions suggested by Williams [15] is used for the the same intemal reinforcement signals from the evaluation stochastic search of network output units. In estimating the network. The evaluation network has only one output node gradient information, the output y of the action network does since it is used to predict the extemal scalar reinforcement not directly act on the environment. Instead, it is treated as signal. The action network decides a best action to impose a mean (expected) action. The actual action, y, is chosen onto the environment in the next time step according to the by exploring a range around this mean point. This range current environment status. The evaluation network models the of exploration corresponds to the variance of a probability environment such that it can perform a single or multistep function which is the normal distribution in our design. This prediction of the reinforcement signal that will eventually be amount of exploration, a@), is chosen as: obtained from the environment for the current action chosen by the action network. The predicted reinforcement signal can k k o(t)= [1  tanh(p(t))]= ____) provide the action network beforehand as well as more detailed 2 1+ e2P(t reward/penalty information (¡°intemal reinforcement signals¡±) about the candidate action for the action network to learn and where k is a searchrange scaling constant which can be simply to decrease the uncertainty it faces to speed up the learning. set to 1, and p ( t ) is the predicted (expected) reinforcement In this section, a reinforcement structure/parameterlearning signal used to predict ~ ( t Equation (10) is a monotonic de). algorithm is proposed to solve the first and second classes creasing function between k and 0, and a ( t )can be interpreted of the reinforcement learning problem on the proposed RNN as the extent to which the output node searches for a better FLCS using a singlestep fuzzy predictor. Since the third class action. Since p ( t ) is the expected reward signal, if p ( t ) is of the reinforcement learning problem is more difficult, a small, the exploratory range, a ( t ) ,will be large according to more powerful multistep fuzzy predictor is necessary for the (10). On the contrary, if p ( t ) is large, ~ ( twill be small. ) RNNFLCS. This will be discussed in Section IV. This amounts to narrowing the search about the mean, y ( t ) , if the expected reinforcement signal is large. This can provide a higher probability to choose an actual action, f (.t . , which is ) A . Stochastic Exploration . very ¡®lose to y ( t ) , since it is expected that the mean action y ( t ) In this section, we first develop the learning algorithm for the action network. The goal of the reinforcement strut is very close to the best action possible for the current given ture/parameter learning algorithm is to adjust the parameters input vector. On the other hand, the search range about the is expected reinforcement (e.g., mi¡¯s) of the action network, to change the connectionist mean y ( t ) is broadened small such that the actual action can have a higher probability structure, or even to add new nodes if necessary, such that the of being quite different from the mean action y(t). Thus, if an reinforcement signal is maximum; that is, expected action has a smaller expected reinforcement signal, Or we can have more novel trials. In terms of searching, the use Ami c . ( (9) of multiparameter distributions in the stochastic nodes (the Om; output nodes of the action network) could allow independent To determine we need to know k ,where y is the output control of the location being searched and the breadth of the a? of the action network. (For clarity, we discuss the singleoutput search around that location. In the twoparameter distribution case first.) Since the reinforcement signal does not provide approach, a predicted reinforcement signal is necessary to any hint as to what the right answer should be in terms of decide the search range a ( t ) . This predicted reinforcement a cost function, there is no gradient information. Hence, the signal can be obtained from the fuzzy predictor. If no such gradient $$can only be estimated. If we can estimate then prediction is available, the search range a ( t ) can be set as the online supervised structure/parameter learning algorithm a constant. Then, the multiparameter distribution approach [4] can be directly applied to the action network to solve reduces to the singleparameter distribution approach, which the reinforcement learning problem. To estimate the gradient has been widely used in the reinforcement learning algorithms information in a reinforcement learning network, there needs [11][14]. Once the variance has been decided, the actual
g,
&,
g,
52
IEEE TRANSACTIONS ON FUZZY SYSYEMS, VOL. 2, NO. 1, FEBRUARY 1994
to perform the structure learning based on the previously proposed fuzzy similarity measures [4] of the output membership (1 1) functions. If structure learning is necessary, then the proposed Y(t> = WY(t)>4 t ) ) . That is, Q(t) is a normal or Gaussian random variable with learning algorithm will further decide whether to add a new output term node (a new membership function); it will also the density function: change the consequenct of some fuzzy logic rules properly. 1 After the structure learning process, the parameter learning will (12) f(Y) = be performed to adjust the current membership functions. This structure/parameter learning will be repeated for each realFor a realworld application, $ ( t )should be properly scaled to time incoming internal reinforcement signal, which appears the final output to fit the input specifications of the controlled either with the same frequency as the external reinforcement plant. This scaling factor or method is applicationoriented. signal (in the singlestep prediction problem) or with much The gradient information is estimated as: higher frequency than the extemal reinforcement signal (in the multistep prediction problem). When the structure/parameter training loop is complete, rule combination [3] is performed to find the minimum node representation of fuzzy logic rules. This is the final step in the process. Before entering the structure/parameter learning loop, as where the subscript t  1 represents the time displacement, shown in Fig. 5, we need to perform two kinds of initialand g is a scaling factor. The time displacements in (13) ization: structure initialization and parameter initialization. In and the following equations reflect the assumption that the the structure initialization, the desired coarse of input fuzzy reinforcement signal (which may be the ¡°predicted¡± reinforce partitions (i.e., the size of the term set of each input linguistic ment signal in the multistep fuzzy predictor) at time step t variable) and the initial guess of output fuzzy partitions must depends on the input and actions chosen at time step t  1. be provided from the outside world. Before this network is In (13), the term is the normalized difference between trained, an initial form of the network is constructed and, the actual and expected actions, T ( t ) is the real reinforcement during the learning process, new nodes may be added and some feedback for the actual action y ( t  l),andp(t) is the predicted connections changed. Finally, after the learning process, some reinforcement signal for the expected action y(t  1). Equation nodes and links of the network will be deleted or combined (13) was derived based on the following intuitive concept. If to form the final structure of the network. In its initial form IT(xi)/ rule nodes with the input ~ ( t>)p ( t ) , then y(t  1) is a better action than the expected (see Fig. 2 ) , there are one, y(t  1),and y(t  1) should be moved closer to c(t  1). of each rule node coming from one possible combination of If ~ ( t < p ( t ) , then Y(t  1) is a worse action than the the terms of input linguistic variables with the constraint that ) expected one, and y ( t  1) should be moved farther away only one term in a term set can be a rule node¡¯s input. Here, from y(t  1). This idea also comes from the observations of IT(xi)l indicates the numberlof terms of xi (i.e., the number a discrete gradient descent method. The concept behind (13) of fuzzy partitions of input state linguistic variable xi). Thus, is frequently adopted in the stochastic exploration techniques the state space is initially divided into lT(x1)l x lT(x2)I x . . . x IT(xn)l linguistically defined nodes (or fuzzy cells) ~91. After the gradient information is available, we have trans which represent the preconditions of fuzzy rules. Furthermore, formed the reinforcement learning problem to the supervised there is only one link between a rule node and an output learning problem and can apply the online supervised struc linguistic variable. This link is connected to a term node of the ture/parameter learning algorithm in [4] to develop the fol output linguistic variable. The initial candidate (term node) of lowing reinforcement structure/parameter learning algorithm the consequent of a rule node can be assigned by an expert for the action network in the proposed RNNFLCS. A de (if possible) or chosen randomly. A suitable term in each tailed derivation and description of the online supervised output linguistic variable¡¯s term set will be chosen for each structure/parameter learning algorithm can be found in [4]. rule node after the learning process. With the initial network One important characteristic of the online supervised struc structure, the parameters in this structure should be initialized. ture/parameter learning algorithm is that it can leam both the The parameter initialization decides the initial membership network structure and parameters simultaneously. Learning the functions of input/output linguistic variables. Theoretically, network structure includes deciding the proper number of out they can be set randomly; however, a more efficient way is to put term nodes in Layer 4 and the proper connections between use identical membership functions such that their domains can the nodes in Layers 3 and 4 of an NNFLC. This learning also cover the region of corresponding input/output space evenly decides the coarse of the output fuzzy partitions and finds according to the given initial coarse of fuzzy partitions. This the correct fuzzy logic rules. Learning the network parameters initilization process is used for both the action network and the includes adjusting the node parameters in Layers 2 and 4. evaluation network, which can be a singlestep or multistep This corresponds to leaming input and output membership fuzzy predictor. functions. The flowchart of this online learning algorithm is After the initialization process, the learning algorithm enters shown in Fig. 5. Given the gradient error information in (13), the training loop in which each loop corresponds to an the proposed learning algorithm first decides whether or not incoming internal reinforcement signal. Basically, the idea of
output of the stochastic node can be set as:
7
ni
LIN AND LEE FUZZY LOGIC CONTROL SYSTEMS
53
hidden nodes. Assuming that w is an adjustable parameter in a node (e.g., the center of a membership function), the general parameter learning rule used is:
Forward Signal Propagation and Stochastic Exploration or Temporal Difference Prediction
dr dw
A W E 
dr
dW
where 7 is the learning rate, and

dr
d(net  input) d r da df __da d f dw ¡¯
d(net  input)  _  f dr d dw df aw
(16)

To show the learning rule, we shall show the computations of layer by layer, starting from the output layer; we use the bellshaped membership functions with centers mi¡¯s and widths oils as the adjustable parameters for these computations. .Layer 5: Using (16), (13), and (8), the adaptive rule of the center m; is derived:
3,
4
Done ?
+r =
Change fuzzy logic rules
Hence, the expected updated amount of the center is:
Parameter adjustment
Similarly, using (6), (13), and (8), the adaptive rule of the width CT~is derived:
I Rule combination I
Fig. 5. Flowchart of the proposed reinforcement structure/parameter learning algorithm.
Hence, the expected updated amount of the width parameter is:
backpropagation [2] is used here to find the errors of node output in every layer except the output layer. These errors are then analyzed by the fuzzy similarity measure to perform structure and/or parameter adjustments. The detailed learning rules are derived. The goal is to maximize the reinforcement signal r ( t ) .For each input vector from the environment, starting at the input nodes, a forward pass computes the activity levels of all the nodes in the network and, at the end, stochastic exploration is performed at the output node to predict Then, starting at the output nodes, a backward pass computes for all the
The error to be propagated to the preceding layer is:
s 5 ( t )=
ar
afa
dr aa y] = = ~ [ r ( t ) p ( t ) ] 
dadf5
[¡±
. (21)
t1
g.
Fuzzy Similarity Measure: In this step, the system decides whether the current structure should be changed according to the exvected uvdated amount of the center and width
54
IEEE TRANSACTIONS ON FUZZY SYSYEMS, VOL. 2, NO. 1, FEBRUARY 1994
parameters [(18) and (20)]. To do this, the expected center and width are, respectively, computed:
m;(t 1) = minew
ai(t
ainew
+ + 1) =
skip the structure learning process. From the current membership functions of output linguistic variables, we want to find the one which is the most similar to the expected membership function by measuring their fuzzy similarity. The fuzzy similarity measure [4] determines the similarity between two fuzzy sets. If A and B are two fuzzy sets with bellshaped membership functions, then:
a ( t )is a monotonically increasing scalar similarity criterion
such that the lower similarity is allowed in the initial stages of the learning. According to this judgment, degree(i,t) is first compared to the similarity criterion. If there is not enough similarity, then a new term node (new membership function) with the expected parameters is built because, in this case, all the current membership functions are too different from the expected one. A new node with the expected membership function is necessary, and the output connections of some just firing rule nodes should be changed to point to this new term node through the structure learning process. If no new term node is necessary, the learning algorithm will then check if the ith term node is the iclosest node. If this is false, some of the just fired fuzzy logic rules should have the iclosest (term) node instead of the original ith term node as their consequent. In this case, the structure learning process should be performed to change the current structure properly. If the ith term node is the iclosest node, then no structural change is necessary; only the parameter learning should be performed by the standard backpropagation algorithm. The structure learning process is given later. Structure Learning: When entering this process, it means that the ith term node in Layer 4 is improperly assigned as the consequent of some fuzzy logic rules which have just been fired strongly. The more proper consequent for these fuzzy logic rules should be the iclosest node. To find the rules whose consequences should be changed, we set afiring strength threshold p. Only the rules whose firing strengths are higher than this threshold are treated as real frring rules. Only the real firing rules are considered for changing their consequent, since only these rules are fired strongly enough to contribute to the results of judgment. Assuming that the term node M ( m i ,ai) in layer 4 receives input from rule nodes 1... 1 in layer 3, whose corresponding firing strengths are ag's,i = 1. . .1, then: IF ag(t) 2 ,B, THEN change the consequent of the ith rule node from M ( m ; ,a;)to M(m;new, ainew). To utilize the error signal more efficiently, the following fine tuning can be performed. Let
The approximate fuzzy similarity measure of A and B , E ( A , B ) , can be computed as follows: Assuming ml 2 mz,
where ( An B ( indicates the cardinality of A be easily computed from:
n B and it can
where h ( z ) = max(0,x). Let M ( m i ,ai)represent the bellshaped membership function with center mi and width a . Let i
where k = IT(y)l is the size of the fuzzy partition of the output linguistic variable y(t). After the most similar membership function M(miciosest, aiclosest) the expected membership to function M(minew, ainew) has been found, the following adjustment is made: IF degree(i, t ) < a ( t ) , THEN create a new node M(minew, Oinew) in layer 4 and denote this new node as the iclosest node, do the structure learning process, ELSE IF M(miclosest, aiclosest) # M ( m i ,ai) THEN do the structure learning process, ELSE do the following parameter adjustments in layer 5 :
LIN AND LEE: FUZZY LOGIC CONTROL SYSTEMS
55
(Amiclosest)extra = (Agiclosest)extra
1
V¡¯~I (minew  miclosest )
$kl(cinew  ciclosest)
where 0 5 77¡¯ < 0 < 1. The subscript ¡°extra¡± denotes the updated amount in addition to those calculated from other possible error signals. The cutoff value p is chosen empirically. In most cases, a consequent label, which is found to cause a large output error, is usually supported by one or a few strongly fired rules. Hence, it is not too difficult to find a proper p value. A p value in the range [ O S , 0.81 usually leads to good learning results. However, it is possible that several weakly fired rules, which share one consequent label, might affect the output substantially. In this extreme case, the structure learning rule might not be able to change the consequent of these rules properly. This case rarely happens because it means that a set of rules sharing one consequent is assigned to a wrong consequent label simultaneously in the learning process. Three possible methods can be used to solve this problem. First, the p value can be changed adaptively such that it will be lowered when all the rules to a ¡°wrong¡± consequent label are fired weakly. Second, a cutoff is put on the effect of the Layer 4 consequent label rather than using such a cutoff at the Layer 3 output. Third, the cutoff p could be eliminated by weighting the changes caused in a Layer 4 node or by the sum of degrees of rules feeding into it with suitable normalization. This weighting scheme will automatically cause large changes in the important rules only and will do so in a smooth and graceful manner. These suggested modifications on the structure learning algorithm need further studies. *Layer 4: There is no parameter to be adjusted in this layer. Only the error signals (6:¡¯s) need to be computed and propagated. The error signal 6 is derived as in the following: :
In the multioutput case, the computations in layers five and four are exactly the same as the ones using the same internal reinforcement signals and proceed independently for each output linguistic variable. *Layer 3: As in layer four, only the error signals need to be computed. According to (7), this error signal can be derived:

8~ d(net  input)4 d(net  input)4 dui
(33)
Hence, the error signal is 6f(t) = 64(t). If there is more than one output, then the error signal becomes 61(t) = XI, 6;(t), where the summation is performed over the consequences of a rule node; that is, the error of a rule node is the summation of the errors of its consequent. *Layer 2: Using (16) and (9,the adaptive rule of mij is derived: d~ dr  d~ dui dfi  eft 2(u; mij) (34) a?.
dmij
dui d f i d m i j
dui
23
where, from (33),
dT

d~
d(net  input)k
dui
  k d(net  input)k dui
(35)
dT   = 62  dT d(net  input)k df,3
and, from (6), d(net input)k  df3 
64  
d~    = da; dT  d~ dfi dui d fi d(net  input)5 d(net  input)5 dai
Hence,
da;
81 . 9 1
1 0
if = min (inputs of rule node IC); otherwise.
(37)
where, from (8), d(net  input)5  df5 da; dU5
where the summation is performed over the rule nodes that ai feeds into, and and, from (21),
dT  dT  = d(net  input)5 d f 5 qk(t) =
.[VI .
t1
s5 = [ T @ )  p ( t ) ]
if ai is minimum in lcth rule node¡¯s input; otherwise. (39) So, the adaptive rule of mij¡¯s is:
0
{ 6z(t)
Hence, the error signal is:
aij
1
d~
daij
(40)
t1
Similarly, using (16), (3,and (35)(39), the adaptive rule of is derived:
d~ dui d f i dui d f i d c i j
(41)
56
IEEE TRANSACTIONS ON FUZZY SYSYEMS, VOL. 2, NO. 1, FEBRUARY 1994
Hence, the adaptive rule of
a;j
becomes:
Since the min operator used in layer 3 is nondifferentiable, the learning rule [(37) and (39)] will cause a Dirac delta type of function. This can be avoided if the softmin operator [33] is adopted. In this case, a form of leaky learning will take place at the antecedent level of the network and may lead to a faster learning with a less number of iterations. However, due to more complex computations of the softmin operator, the actual learning time in each iteration will be longer. Hence, for practical considerations, the min operator is selected for performing the fuzzy AND operation in layer 3. The whole learning procedure is summarized by the flowchart in Fig. 5 . The proposed reinforcement learning algorithm provides a novel online scheme to combine the structure learning and the parameter learning such that they can be performed simultaneously. Finally, it should be noted that this backpropagation algorithm can be easily extended to train the membership function implemented by a subneural net instead of a singleterm node in layer two since, from the analysis, the error signal can be propagated to the output node of the subneural net. Then, using a similar backpropagation rule in this subneural net, the parameters in this subneural net can be adjusted.
1
. (42)
t1
predictor is exactly the same as the online supervised learning algorithm proposed in [4] for the NNFLC with a single output node. The goal to train the singlestep fuzzy predictor is to minimize the squared error prediction:
E = 5(T(t) P ( W 2
1
(43)
where ~ ( trepresents the desired output (real extemal rein) forcement signal), and p ( t ) is the current output (predicted reinforcement signal). Then, the gradient information can be easily derived:
(44) aP Similar to the leaming rule developed in the last subsection, we can derive the structure/parameter learning algorithm for the singlestep fuzzy predictor using the following general parameter learning rule:
=p(t)  r(t). dE
w(t
+ 1) = w(t) + 77 (

E)
(45)
where w is the adjustable parameters in the fuzzy predictor. The leaming equations are the same as (16)(42) if is replaced by and the effects caused by this replacet ment are properly updated; that is, all the terms [ ~ ( )
(g)
p(t)]
[e],_,
in (16)(42) are replaced with the term [ ~ ( ) t
P(t)l.
B. SinaleSter, Fuzzv Predictor
We shall use an NNFLC to develop a singlestep fuzzy predictor (evaluation network) as shown in Fig. 4. It shares the same fuzzifier as the action network; that is, both use the same intemal representation, which is an overlapping type of distributed representation of input pattems. The fuzzy predictor receives an extemal reinforcement signal from the environment and produces intemal reinforcement signals to the action network. The function of the singlestep fuzzy predictor is to predict the extemal reinforcement signal, ~ ( t ) one time step ahead; that is, at time t  1. Here, ~ ( tis) the real reinforcement signal resulting from the inputs and actions chosen at time step t  1,but it can only be known at time step t in the first and second classes of the reinforcement leaming problem. If the fuzzy predictor can produce a signal p ( t ) , which is the prediction of r ( t ) but available at time step t  1, then the time delay problem can be solved. With a correct predicted signal p ( t ) , a better action can be chosen by the action network at time step t 1 and the correspondingleaming can be performed on the action network at time step t upon receiving the extemal reinforcement signal T ( t ) .As indicated in the last subsection, p ( t ) is necessary for the stochastic exploration with multiparameter probability distribution (10). The other intemal reinforcement signal, ? ( t ) , Fig. 4 is set as in ?(t)= r ( t ) p ( t ) , which is the prediction error for computing IV. MULTISTEP FUZZY PREDICTOR (13) by the action network. The singlestep prediction is the extreme case of the multistep prediction which will be When both the reinforcement signal and input pattems from presented in the next section. Basically, the training of a single the environment may depend arbitrarily on the past history step predictor is a simple supervised learning problem. Thus, of the network output and the network may only receive a the reinforcement learning algorithm for the singlestep fuzzy reinforcement signal after a long sequence of outputs, the
, ,
In the RNNFLCS, the action and evaluation networks are to be trained together. In principle, we could perform the learning for both networks at the same time since they can be treated individually. Although they share the same fuzzifier, they can find a common input intemal distributed representation (i.e., the same input membership functions) suitable for them. However, since the action network relies on the accurate prediction of the evaluation network, it seems practical to train the fuzzy predictor first, at least partially, or to let the fuzzy ,predictor have a higher leaming rate than the action network. After the consequents of rule nodes are determined for both the action and evaluation networks (i.e., when the structure leaming process is done and the structure will not be changed any further, but the parameter refinement may still be performed), the rule combination in [3], [4] is performed to find the minimum node representation of current fuzzy logic rules. The criteria for a set of rule nodes to be combined into a single rule node are: 1) that they have exactly the same consequents; 2) some preconditions are common to all the rule nodes in this set; and 3) the union of other preconditions of these rule nodes composes the whole term set of some input linguistic variables. If a set of nodes meets these criteria, a new rule node with only the common preconditions can replace this set of rule nodes.
LIN AND LEE: FUZZY LOGIC CONTROL SYSTEMS
51
credit assignment problem becomes severe. This temporal credit assignment problem results because we need to assign credit or blame to each step individually in such a long sequence for an eventual success or failure. Hence, for this class of reinforcement learning problem, we need to solve the temporal credit assignment problem together with the original structure credit assignment problem of attributing network error to different connections or weights. The solution to the temporal credit assignment problem in the RNNFLCS is to design a multistep fuzzy predictor, which can predict the reinforcement signal at each time step within two successive external reinforcement signals which may be separated by many time steps. This multistep fuzzy predictor can assure that both the evaluation network and the action network do not have to wait until the actual outcome is known, and they can update their parameters and structures within the period without any evaluative feedback from the environment. To solve the temporal credit assignment problem, the technique based on the temporal difference methods, which are often closely related to the dynamic programming techniques [22], is used [12], [21]. Unlike the singlestep prediction or the supervised learning method which assigns credit according to the difference between the predicted and actual output, the temporal difference methods assign credit according to the difference between temporally successive predictions. Some important temporal difference equations of three different cases are summarized below. C a s e 1Prediction of final outcome: Given the observationoutcomesequences of the form x1 , 2 2 , . . . , x, , z , where each xt is an input vector available at time step t from the environment and z is the external reinforcement signal available at time step m 1. For each observationoutcome sequence, the fuzzy predictor produces a corresponding sequence of predictions p l r p 2 , . . . , p m , each of which is an estimate of z . Since pt is the output of the evaluation network at time t , p t is a function of the network's input xt and the network's adjustable parameters wt and can be denoted as p(xt,wt), where wt can be mz(t)(center of membership function) or oi(t) (width of membership function). For this prediction problem, the learning rule, which is called TD(X) family of learning procedures, is:
the convergence of TD(0) when pt is a linear function of xt and wt can be found in [21]. C a s e 2Prediction of finite cumulative outcomes: In this case, pt predicts the remaining cumulative cost given the tth observation, x t , rather than the overall cost for the sequence. This case happens when we are more concerned with the sum of future predictions than the prediction of what will happen at a specific future time. Let rt be the actual cost incurred between time steps t  1 and t. Then, pt1 is to predict ztl = E m f l r k . Hence, the prediction error is m+"Ft m+l Zt1  pt1 = C k = t Tk  pt1 = Ck=t (Tk k pk  p k  l ) , where p+ ,l is defined as 0. Thus, the learning rule is:
t1
Awt = q(.t
+ Pt
Pt1)
XtklVwpk. k=l
(47)
.Case 3Prediction of infinite discounted cumulative outcomes: In this case, ptl 'predicts ztl = EEoykrt+rc = rt ypt, where the discountrate parameter y,O 5 y < 1, determines the extent to which we are concerned with short or longrange prediction. This is used for prediction problems in which exact success or failure may never become completely known. In this case, the prediction error is (rt y p t )  p t  1 , and the learning rule is:
+
+
t1
Awt = ~ ( r t ypt  pt1)
k=l
+
XtklVw~k.
(48)
In applying the temporal difference procedures to the proposed RNNFLCS, we let X = 0 due to its efficiency and accuracy [21]. A general learning rule used for the three cases is: Awt = v(rt
+
+ ypt  Pt1)VwPt1
(49)
0 where y, 5 y < 1, is a discountrate parameter and q is the learning rate. We shall next derive the learning rule of the multistep fuzzy predictor according to (49). In this case, p ( t ) is the single output of the fuzzy predictor (evaluation network) for the network's current parameter w(t) and current given input vector x ( t ) at time step t. Here, p ( t ) can be any kind of prediction output in the various cases of the multistep prediction problem stated. According to (49), let:
t1
?(t)= ~ ( t )y p ( t )  p ( t  l), +
0
5 y < 1.
(50)
where p,+l z,O 5 X 5 1, and q is the leaming rate. X is the recency weighting factor with which alternations to the predictions of observation vectors occurring k steps in the past are weighted by Xk. In the extreme case that X = 1, all the proceeding predictions, p l ,p a , . . . ,pt 1, are altered properly according to the current temporal difference, pt  p t  l , to an "equal" extent. In this case, (46) reduces to a supervisedlearning approach and, if pt is a linear function of xt and wt, then it is the same as the WidrowHoff procedure [23]. In the other extreme case that X = 0, the increment of the parameter w is determined only by its effect on the prediction associated with the most recent observation. A theorem about
Then, ?(t) the error signal of the output node of the multistep is fuzzy predictor. The general parameter learning rule, then, is: Aw(t) = q?(t) 
El
tl
where w is the network parameter (i.e., m or a ) The learning i i. rule for each layer in the fuzzy predictor can be computed as in (16)(42). The only exception is that the error signal is different. Thus, the learning equations for the multistep fuzzy predictor are the same as in (16)(42) but with the term [ ~ ( t ) p ( t ) ] [ 5 ] t  l replaced by the term ?(t) in (50). Also, the multistep fuzzy predictor will provide two internal reinforcement signals, the prediction output p ( t ) , and
58
[EEE TRANSACTIONS ON FUZZY SYSYEMS, VOL. 2, NO. 1, FEBRUARY 1994
the prediction error ?(t)to the action network for its learning (see Fig. 4). The learning algorithm for the action network is the same as that derived in section 111A. However, due to the different nature of the intemal reinforcement signal ?(t ) , the learning algorithm of the action network with the multistep fuzzy predictor will be different. The goal of the action network is to maximize the extemal reinforcement signal r (t).Thus, we need to estimate the gradient information as we did previously. With the intemal reinforcement signals p ( t ) and ?(t),from the evaluation network, the action network can perform the stochastic exploration and learning. The prediction signal p ( t ) is used to decide the variance of the normal distribution function in the stochastic exploration in (10). Then, the actual output y ( t ) can be determined according to (1 1). Since ?(t)is the prediction error, the gradient information is estimated as:
I e
z,
I I I
I
. I
1 
I
I
IX I
Fig. 6. The cartpole balancing system.
dr
t1
In (50), predictionerroris ?(t)= r ( t ) + y p ( t )  p ( t  1 ) = the r(t)[p(t1)yp(t)]. Since p(t1) predicts the accumulated reinforcement signal in the future [i.e., r ( t )+ y p ( t ) ] , p ( t  1)y p ( t ) predicts the next reinforcement signal [i.e., ~ ( t ) ]Thus, . r ( t )is the reinforcement signal with respect to the actual action y(t  l),and [p(t 1) y p ( t ) ] is the reinforcement signal with respect to the expected action y(t 1).Then, from the equation
we can observe that if ~ ( t> [p(t  1)  y p ( t ) ] ,the actual ) action ij(t  1) is better than the expected action y(t  1). So, y(t  1) should be moved closer to y(t 1). On the other side, if r ( t ) < [p(t 1)  ~ p ( t ) ]then the actual action Y ( t  1) is , worse than the expected action y(1  1). So, y(t  1) should be moved further away from Y(t  1). Having the gradient information [(53)], the leaming algorithm of the action network can be determined in the same way as in the previous section. The exact learning equations are the same as in (14)(42), except that [ r ( t ) p ( t ) ] has been replaced by the new error term [ r ( t ) y p ( t )  p ( t 
1)1 tl¡® Until now, we have developed the reinforcement learning algorithm for the action network with the multistep fuzzy predictor in the multistep prediction problem. The issues of learning rate and learning order for both the action and evaluation networks and the final step to find the minimum node representation of fuzzy logic rules using the rule combination technique are the same as discussed in Section 111. One interesting advantage of using the NNFLC as a predictor is attributed to the highlevel, humanunderstandable meaning of the NNFLC [3],[4]. When the RNNFLCS works in a learning environment, the fuzzy predictor attempts to model the statusreaction relation of the environment, and this relation can be interpreted as the ¡°IFTHEN¡± rules on the input/output linguistic variables. For example, the interpretation may resemble ¡°IF the load is a little light and the power is
[e]
+
[G]
t1
very high, THEN the machine is easily overrunning.¡± Using NNFLC as a predictor is more interesting when there are human factors in the learning environment. For example, if we use the RNNFLCS to design an air conditioner, then the fuzzy predictor can model the user¡¯s sensitivity of feeling like this: ¡°IF the temperature is warm and the humidity is high, THEN I feel a little uncomfortable.¡± Moreover, the learned fuzzy predictor can tell us exactly what the user means by ¡°warm,¡± ¡°high humidity,¡± and ¡°a little uncomfortable¡± via the learned inpudoutput membership function, though such feelings may vary from person to person. A classic application of the reinforcement learning is in game theory where the ¡°environment¡± is another player or players. For example, considering the RNNFLCS for a chess game, the fuzzy predictor can leam its opponent¡¯s skill by explaining its opponent¡¯s thinking rules. The humanunderstandable meaning of the NNFLC can also easily make the RNNFLCS incorporate existent or obvious expert knowledge. This not only benefits the design of a fuzzy logic controller, but is also a great help to the fuzzy predictor in some applications. Although the membership functions are always difficult to find, the fuzzy logic rules may be obvious in some cases. Considering the air conditioner example, the rules of users¡¯ reactions to a room¡¯s status are much alike, but the standards of the feeling for highbow temperature may be quite different. In such a case, we can set up the structure of the fuzzy predictor manually according to the known fuzzy logic rules and let the input/output membership functions leam to fit different users.
V. AN ILLUSTRATIVE EXAMPLE
The proposed RNNFLCS with multistep fuzzy predictor has been simulated on a Sun SPARC station for the cartpole balancing problem or the socalled inverted pendulumbalancing problem. This problem is often used as an example of inherently unstable and dynamic systems to demonstrate both modern and classic control techniques [24], [25], as well as the learning control techniques of neural networks using supervised learning methods [26] or reinforcement learning methods [12], [27], [28]. As shown in Fig. 6, the cartpole balancing problem is the problem of learning how to balance an upright pole. The
LIN AND LEE: FUZZY LOGIC CONTROL SYSTEMS
59
bottom of the pole is hinged to a cart that travels along a finitelength track to its right or left. Both the cart and the pole can move only in the vertical plane; that is, each has only one degree of freedom. There are four input state variables in this system: 0, the. angle of the pole from an upright position (in degrees); 19, the angular velocity of the pole (in degrees/second); Z, the horizontal position of the cart's center (in meters); and k , the velocity of the cart (in d s ) . The only control action is f , which is the amount of force ( N ) applied to the cart to move it toward its left or right. The system fails and receives a penalty signal of 1 when the pole falls past a certain angle (* 12 degrees is used here) or the cart runs into the bounds of its track (the distance is 2.4m from the center to both bounds of the track). The goal of this control problem is to train the RNNFLCS such that it can determine a sequence of forces with proper magnitudes to apply to the cart to balance the pole for as long as possible without failure. The model and corresponding parameters of the cartpole balancing system for our computer simulation are adopted from [29] with additional consideration of friction effects. This model and its parameters are also used by Barto, Sutton, and Anderson [12], [28]. The equations of motion that we used are shown below, where 1) g = 9.8m/s2, acceleration due to the gravity 2) m = 1.1 kg, combined mass of the pole and the cart 3) mp = 0.1 kg, mass of the pole 4) I = 0.5 m, halfpole length 5) pc = 0.0005, coefficient of friction of the cart on the track 6 ) p p = 0.000002, coefficient of friction of the pole on the Cart 7) A = 0.02, sampling interval The constraints on the variables are 12' 5 O 5 12",2.4m 5 x 5 2.4m, and lON 5 f 5 ION. In designing the controller, the equations of motion of the cartpole balancing system are assumed to be unknown to the controller. A more challenging part of this problem is that the only available feedback is a failure signal that notifies the controller when a failure occurs; that is, either 181 > 12" or 1x1 > 2.4m. This is a typical reinformation learning problem, and the feedback failure signal serves as the reinforcement signal. Since a reinforcement signal may only be available after a long sequence of time steps in this failure avoidance
INPUT/oUTPUT MEMBERSHIP
TABLE I FUNCTIONS BEFORE AND
&TER
LEARNING
U
1
1 1
4.4
1
X
1 2 0
0.0
2 . . 4
I
I
2.6
I
1.5
9fi
I
1.YX
0.13
917
I
121 ,.8
I
4 13 17.21 7.39 16.01 7.39 4.81 212 1.92 1.89 3 x7 8.01 84.62 39.42
~ ~~
2
1
2
20.0 0.0 20.0
20.0 10.0 20.0
0
e
1 2
12.0 6.0
3.0 0.0 3.0
fin .
6.0 0.0 3.0
~~
3 1 4 1
I I
I
3.0 3.0
fin 6.0
I I
I
15.42 0.21 17.25 10.22 5.01 1.82 0.21 2.19
~ ~~
I I
I
I
5
1
,531
6
0 1 2
I
e
I
1
12.0 100.0 0.0
I
I
I
I
100.0 50.0
I
1no.n
I
I
8.82 84.29 0.32
~ ~~
I
I
I
1
task, this cartpole balancing problem belongs to the third class of the reinforcement learning problems discussed in Section 111. Thus, a multistep fuzzy predictor is required for the RNNFLCS. Moreover, since the goal is to avoid failure for as long as possible, there is no exact success in finite time. Also, we hope that the RNNFLCS can balance the pole for as long as possible for infinite trials, not just for one particular trial, where a "trial" is defined as the time steps from an initial state to a failure. Hence, this cartpole balancing problem is further categorized as the third case of the multistep prediction problem discussed in Section IV, and (49) must be used for the temporal difference prediction method. The reinforcement signal is defined as:
T(t)
=
{0 1
if IO(t)l > 12" or Jz(t)l> 2 . 4 ~ ~ ; (55) otherwise
and the goal is to maximize the sum y ' ~ ( t k), where y is the discount rate. In our computer simulation, the learning system was tested for ten runs by trying to use the same learning parameter values in [12]. Each run consisted of a sequence of trials; each trial
+
O(t + 1) = O(t) + Ae(t)
(54)
e(t + 1) = e ( t ) + A
mgsinB(t)  c o s O ( t ) [ f ( t )
+ m,Z(e(t)~/180)~sinO(t) pcsgn(j.(t))]  ppgpe/t) (4/3)mZ  mpZ cos2 O ( t )
z ( t + 1) = ~ ( t )A k ( t ) , +
k(t
sinO(t) + 1) = k ( t ) + A f ( t ) + mpl[(~(t)7r/180)2 m j(t)~/180cosO(t)] p,sgn[i(t)]

60
IEEE TRANSACTIONS ON FUZZY SYSYEMS, VOL. 2, NO. 1, FEBRUARY 1994
began with the same initial condition O(0) = e(0) = z(0) = i ( 0 ) = 0 or with a randomized initial condition, and ended with a failure signal indicating that either l e ( t ) ( > 12¡± or Iz(t)l > 2.4m. The randomized initial condition means that, after each failure, the initial configuration was independently and randomly chosen such that 10 < O(0) < 10,50 < O(0) < 50, 2 < z(0) < 2, and 10 < i ( 0 ) < 10. The input ( fuzzy partitions were set as IT(z)I = 3 , ( T ( i ) = 3, IT(O)l = 7, and IT(0)l = 3 for all runs. For each run, the input (output) membership functions were initialized so that they covered the whole input (output) space evenly, and the output fuzzy partition was initialized as IT(f)l= 7. The membership functions were chosen to be the bellshaped functions (5) with initial centers and widths shown in Table I. Also, in the initiation of each run, each rule was assigned with a consequent term randomly. There is a total of 189 rules in the beginning. Here, we assumed that no expert knowledge (in the form of IFTHEN rules) is available to this control problem. If expert knowledge is available in advance, then the initial rules can be set up correctly. In this case, we can even skip the structure learning portion in our learning algorithm, and this will greatly shorten the learning time. A study on this issue has been reported in [4]. Runs consisted of at most 50 trials, unless the duration of each run exceeded 500 000 time steps. The results of two ¡°trials¡± are shown in Fig. 7. A run was successful and terminated after 500 000 time steps before all 50 trials took place; otherwise, it was called ¡°a failure¡± and terminated at the end of its 50th trial. The two trials shown in Fig. 7 are two failure trials. The status of system parameters in the cartpole balancing problem is also shown in this figure. The first example in Fig. 7 is a trial in the early stage of a ¡°run,¡± so it failed in only about 450 time steps. From the figure showing the deviation angle, we can see that this trial failed because the pole fell outside the desired range (the pole¡¯s deviation angle is greater than 12¡±). At this stage, the FLC had learned to some extent in the area around the zero state of the input space since the FLC did try to decrease the pole¡¯s deviation angle before the failure occurred (see the figure showing the force). However, it still needs extensive learning in the other portion of the input space. The second example in Fig. 7 is a trial in the later stage of a ¡°run,¡± so it keeps the pole in the desired position longer. The system failed at around the 2800th time steps because the cart bumped into the right wall. At this stage, the FLC had learned quite well in some areas close to the zero state of the input space and it could control the pole in the correct position well. However, more leaming is still necessary for the FLC to control the position of the cart. It seems that more training data from some ¡°desired¡± regions of the input space (especially training data corresponding to the far ends of the track) were necessary for further training of the FLC. In our computer simulations, a total of ten runs were performed. Among these ten runs, five of them started with zero initial conditions and the others started with randomized initial conditions. The simulation results (see Fig. 8) showed that the RNNFLCS can leam to balance the pole within 20 trials. In most of the ten runs, the leaming was completed before ten trials. It was observed that the runs starting with
x o 1
fq hgl
0
100
m
300
400
Time steps
2 0
10
100
200
ux)
400
0
100
m
300
400
Time fteps
rn sreps ue
I
¡®$1
NNFLCSon the cartpole
Fig. 7. Two trials in the leaming process of the balancing problem.
a randomized initial condition usually took more trials. Fig.
9 shows the angle deviation of the pole about the center
point when the cartpole system was controlled by a welltrained RNNFLCS. This performance is better than the results presented in [12], [28] and compatible to those in 1201. In most runs, the final number of leamed output membership functions is less than 15, as compared to 189 output membership functions that were used in [20] for each run; that is, one output membership function for each (overlapping) grid of input space. One learned fuzzy logic controller (action network) in a run is shown in Fig. 10. It has a total of 35 fuzzy logic rules. Each node in the third layer corresponds to one rule; the links to and from a rule node specify the preconditions and conclusion of the corresponding rule. The parameters of the leamed input membership functions in the fuzzifier (the second layer of the action network), and the learned output membership functions in the defuzzifier (the fourth layer of the action network) are tabulated in Table I. Similar to the simulations in [20], the adaptation capability of the proposed RNNFLCS was tested. In a series of tests, the proposed RNNFLCS was required to adapt to changes in the length and mass of the pole. Six tests were performed. The first two tests were to increase the original mass of the pole by a factor of 10 and 20, respectively. In the next two tests, the original pole was shortened, respectively, to 213 and 113 of the
LIN AND LEE: FUZZY LOGIC CONTROL SYSTEMS
61
4 1
Angk 0
Tum sep
2
I
I
I
t
4
I
I
I
I
200
50
I00
150
The (&)
Fig. 9. Variations in B produced by the learned RNNFLCS.
Time r e p
Fig. 7. Continued.
5
10
I5
Trial\
Fig. 8. Performance of the RNNFLCS on the cartpole balancing problem.
original pole length. The last two tests were to simultaneously reduce the length and weight of the pole to 213 and 113 of its original values, respectively. We found that the RNNFLCS with pretrained knowledge was able to complete the control task of balancing the pole without further trainindearning. In addition to keeping the capability that learning can perform at each time step within a trial without waiting to know the actual outcome by using the multistep fuzzy predictor, the proposed RNNFLCS has some important features over the previous work. First, distributed representation is used to represent the input vectors in the RNNFLCS. This is achieved by the fuzzification process through the
adaptive input membership functions. With the adaptive input membership functions, the input space can be considered to be divided into overlapping smaller regions and, more importantly, this partition is not performed in advance but is dynamically and appropriately adjusted during the learning process. As a result, each smaller region varies in size and the degree of overlapping is also adjustable. This is in contrast to the localized storage scheme of the BOXES system [12], [27], in which the input space is divided into disjoint regions (boxes) and no generalization occurs beyond the confines of a given box. The fuzzification process in the RNNFLCS also avoids the necessity of partitioning the input space into disjoint [12], [27] or overlapping [20], [30] small regions in advance. The drawback of this process is that it requires the learning control designer to know how to quantify or partition the input space properly according to the knowledge of the control task; otherwise, the input space can be easily partitioned in such a way that the controller will not work at all. In most cases, a priori selection of a sufficiently fine representation is required. However, a representation that is too fine may result in poor generalization capabilities and unnecessarily slow learning. The second most important feature of the proposed RNNFLCS is its dynamic structurelparameter learning ability and the rule combination process. The former can add new nodes to the RNNFLCS properly, while the latter can combine some nodes with the same or a very similar control action. These functions realize the concept of ¡°splitting¡± and ¡°lumping¡± boxes in [31], where the authors tried to improve their BOXES system without much success. The third feature of the proposed RNNFLCS is its stochastic search with multiparameter distribution functions. This results in the action network having a higher probability of finding a better action via the prediction signal from the evaluation network instead of performing random searches around the expected action using only singleparameter distribution functions. The fourth important feature of the proposed RNNFLCS is the ease of incorporating expert knowledge into the fuzzy logic controller or the fuzzy predictor of the RNNFLCS to greatly shorten the learning time. The continuous control scheme is another important feature of the RNNFLCS. By performing fuzzy inference and defuzzification, the action network of the RNN

62
IEEE TRANSACTIONS ON FUZZY SYSYEMS, VOL. 2, NO. 1, FEBRUARY 1994
Layer 5
(Output
r
Ilnguistl nodes)
Layer 4
(Output
term nodes)
Layer 3
(rule
nodes)
Layer 2 (Input
term node51
Layer I (Inout
tingulstl
nodes)
Fig. 10. The learned fuzzy logic controller on the cartpole balancing problem.
FLCS determines an analog output control signal. This is in contrast to the bangbang control scheme used in [12], [27], Thus, [28], where the output can only have two values: &lON. the angle deviation of the pole about the center point, as shown in Fig. 9, is rather small in our computer simulation. Recently, Berenji and Khedkar [32], [33] independently proposed a FLC structure and its associated learning algorithm which are similar to what we propose here. They constructed a model, called Generalized Approximate Reasoningbased Intelligent Control (GARIC) architecture, for learning and tuning a fuzzy logic controller based on reinforcement signals from a dynamic system. Their architecture extends Anderson¡¯s method [28] by including apriori control knowledge of expert operators in terms of fuzzy control rules. In GARIC, a threelayer feedforward neural network, called Action Evaluation Network (AEN), was used to predict the external reinforcement signals and produce internal reinforcement signals. The role of the AEN can be considered to be parallel to our fuzzy predictor in the RNNFLCS. They also adopted a fivelayer feedforward neural network, called Action Selection Network (ASN), to implement a FLC as we did. Moreover, a stochastic action modifier was used to stochastically generate an actual action according to the expected action suggested by the ASN and the internal reinforcement signals from the AEN. The ASN, along with the stochastic action modifier, play the same role as the action network in our RNNFLCS. Hence, there are two major differences between the structures of RNNFLCS and GARIC. First, the RNNFLCS uses a fuzzy predictor while the GARIC uses a general multilayer neural network for the reinforcement signal prediction. Second, in the RNNFLCS, the action network (fuzzy controller) and the evaluation network (fuzzy predictor) share the same internal input representation. However, in GARIC, the corresponding two networks, the ASN and AEN, are structured independently. The learning algorithm in the AEN of GARIC used Sutton¡¯s AHC algorithm
[12], [28] for the output units and the error backpropagation algorithm for the hidden units. The stochastic exploration technique and the error backpropagation algorithm were used to find the proper parameters in the ASN. This corresponds to the adjustment of the input/output membership functions of a fuzzy logic controller. Although these techniques are also adopted in the RNNFLCS, our learning algorithms for the evaluation network and the action network are developed based on our previously proposed online structure/parameter learning algorithm [4]. Hence, in addition to the parameter learning, they can perform the structure learning and find the proper fuzzy logic rules dynamically. This can be considered the major difference between RNNFLCS and GARIC. Furthermore, we have developed both the singlestep and multistep fuzzy predictors to incorporate three different cases of reinforcement learning problems. This makes the RNNFLCS a much more general structure than the GARIC. The learning speed of the GARIC for the cartpole balancing problem was faster than that of the RNNFLCS. However, a set of correct fuzzy logic rules must be given in advance by experts before initiating training of the GARIC. VI. CONCLUSION
This paper described the development of integrating two NNFLC¡¯s into an integrated Reinforcement NeuralNetworkBased Fuzzy Logic Control System for solving various reinforcement learning problems. By combining the techniques of temporal difference, stochastic exploration, and the previously proposed online supervised structure/parameter learning algorithm, a reinforcement structure/parameter learning algorithm was derived for the RNNFLCS. Using the proposed connectionist structure and learning algorithm, a fuzzy logic controller to control a plant and a fuzzy predictor to model the plant can be set up dynamically through simultaneous structure/parameter learning for various classes of
LIN AND LEE: FUZZY LOGIC CONTROL SYSTEMS
63
reinforcement learning problems. The proposed RNNFLCS makes the design of fuzzy logic controllers more practical for realworld applications since it greatly lessens the quality and quantity requirements of the feedback training signals. Computer simulations of the cartpole balancing problem satisfactorily verified the validity and performance of the proposed reinforcement structure/parameterlearning algorithm for RNNFLCS. Future work will focus on an interesting problem of regulating the cart¡¯s position from an initial position to a set point, in addition to keeping the pole balance. This problem has multiple goals with different priorities. A hierarchical FLC design approach is needed. This is a topic of our ongoing research which will be discussed in a future paper.
ACKNOWLEDGMENT
The authors would like to thank the reviewers for their helpful suggestions in improving the quality of the final manuscript.
REFERENCES
1. G . E. Hinton, ¡®¡®Connectionist learning procedures,¡± Art. I n t e k vol. 40 no. 1 pp 143150, 1989. 2. ¡¯D. E.¡¯Ruhelhart, G . E, Hinton, and R, Williams, internal representations by err& propagation,¡± Parallel Distrib. Proces. Cambridge, MA: M.I.T. Press 1986, vol. 1, pp. 318362. and ~ , 3, C, T, L~~ C, S, G, L ~¡°NeuralnetwoTkbased fuzzy logic control and decision swtem,¡± IEEE Trans. Comout.. vol. C40. no. 12. DD. 13201336, Dec. 1991. 4. C. T. Lin and C. S. G . Lee, ¡°Realtime supervised structure/parameter learning for Fuzzy Neural Network,¡± in Proc. IEEE Int. Con$ Fuzzy Syst., San Diego, CA, Mar. 1992, pp. 12831290. 5. A. Dickinson, Contemporary Animal Learning Theory. Cambridge, MA: Cambridge Univ. Press, 1980. 6. W. K. Estes, ¡°Toward a statistical theory of learning,¡± Psych. Rev., vol. 57, pp. 94107, 1950. 7. M. L. Tsetlin, Automation Theory and Modeling of Biological Systems. New York: Academic, 1973. 8. T. M. Cover and M. E. Hellman, ¡°The twoarmed bandit problem with timeinvariant finite memory,¡± IEEE Trans. Inform. Theory, vol. IT16, pp. 185195, 1970. 9. K. S. Narendra and M. A. L. Thathachar, Learning Automata: An Introduction. Englewood Cliffs, NJ: Prentice Hall, 1989. 10. A. H. Klopf, The Hedonistic Neuron: A Theory of Memory, Learning, and Intelligence. Washington, DC: Hemisphere, 1982. 11. A. G. Barto and R. S. Sutton, ¡°Landmark learning: an illustration of associative search,¡± Biol. Cybern., vol. 42, pp. 18, 1981. 12. A. G . Barto, R. S. Sutton, and C. W. Anderson, ¡°Neuronlike adaptive elements that can solve difficult learning control problems,¡± IEEE Trans. Syst. Man Cybern., vol. SMC13, no. 5, pp. 834847, 1983. 13. A. G . Barto and P. Anandan, ¡°Patternrecognizing stochastic learning automata,¡± IEEE Trans. Syst. Man Cybern., vol. SMC15, no. 3, pp. 360375, 1985. 14. A. G . Barto and M. I. Jordan, ¡°Gradient following without backpropagation in layered networks,¡± in Proc. 1987 Int. Joint Conf Neural Nem., San Diego, CA, vol. 11, pp. 629636. 15. R. J. Williams, ¡°A class of gradientestimating algorithms for reinforcement learning in neural networks,¡± in Proc. 1987 Int. Joint Conf Neural Nem., San Diego, CA, vol. 11, pp. 601608. 16. C. C. Lee, ¡°Fuzzy logic in control systems: Fuzzy logic controllerParts I and 11,¡± IEEE Trans. Syst. Man Cybern., vol. SMC20, no. 2, pp. 404435, 1990. 17. L. A. Zadeh, ¡°Fuzzy logic,¡± IEEE Compur., pp. 8393, Apr. 1988. 18. J. Hertz, A. Krogh, and R. G . Palmer, Introduction to the Theory of Neural Computation. New York: AddisonWesley, 1991, pp. 188189. 19. J. A. Franklin, ¡°Input space representation for reinforcement learning control,¡± in Proc. IEEE Int. Con8 Intell. Mach., 1989, pp. 115122. 20. C. C. Lee and H. R. Berenji, ¡°An intelligent controller based on approximate reasoning and reinforcement learning,¡¯¡¯ in Proc. IEEE Int. Conf. Intell. Mach., 1989, pp. 20C~205. 21. R. S. Sutton, ¡°Learning to predict by the methods of temporal difference,¡± Mach. Learning, vol. 3, pp. 944, 1988.
J,
[22] 22. P. J. Werbos, ¡°A menu of design for reinforcement learning over time,¡± in Neural Nemorks for Control, W. T. Miller, 111, R. S. Sutton, and P. J. Werbos, Eds. Cambridge, MA: M.I.T. Press, 1990, ch. 3. [23] 23. B. Widrow and M. E. Hoff, ¡°Adaptive switching circuits,¡± I960 WESCON Convent. Rec., part IV, pp. 96104. [24] 24. R. H. Cannon, Jr., Dynamics of Physical Systems. New York: McGrawHill, 1967. [25] 25. K. C. Cheok and N. K. Loh, ¡°A ballbalancing demonstration of optimal and disturbanceaccommodating control,¡± IEEE Contr. Syst. Mag., pp. 5457, Feb. 1987. [26] 26. B. Widrow, ¡°The original adaptive neural net broombalancer,¡± in Proc. Int. Symp. Circ. and Syst., May 1987, pp. 351357. [27] 27. D. Michie and R. A. Chambers, ¡°BOXES: An experiment in adaptive control,¡± in Much. Intell., E. Dale and D. Michie, Eds. Edinburgh, Scotland: Oliver and Boyd, 1968, vol. 2, pp. 137152. [28] 28. C. W. Anderson, ¡°Strategy learning with multilayer connectionist representations,¡± in Proc. Fourth Int. Workshop on Mach. Learn., Irvine, CA, June 1987, pp. 103114. 1, [29] 29. C. W. Anderson and W. T. Miller 1 1 ¡°Challenging control problems,¡± in Neural Networks for Control, W. T. Miller, III, R. S. Sutton, and P. J. Werbos, Eds. Cambridge, MA: M.I.T. Press, 1990, ch. A. [30] 30. J. S. Albus, ¡°Mechanisms of planning and problem solving in the brain,¡± Math. Biosci., vol. 45, pp. 247293, 1979. 1311 31. D. Michie and R. A. Chambers, ¡®Vhinspace ¡®Boxes¡¯ as a model of patternformation,¡± in Towards a Theoretical Biology, I. Prolegomena and C. H. Waddington, Eds. Edinburgh: Edinburgh Univ. Press, 1968, vol. 1, pp. 206215. [321 32, H, R, Berenji, architecture for designing fuzzy controllers using neural networks,¡± Int. J. Approx. Reason., vol. 6, no. 2, pp. 267292, Feb. 1992. [33] 33. H. R. Berenji and P. Khedkar, ¡°Learning and tuning fuzzy logic controllers through reinforcements,¡± IEEE Trans. Neural Netw., vol. 3, no. 5, pp. 724740, 1992.
ChinTeng Lin received the B.S. degree in control engineering from the National ChiaoTung University, Taiwan, R.O.C., in 1986 and the M.S.E.E. and Ph.D. degrees in electrical engineering from Purdue University, West Lafayette, IN, in 1989 and 1992, respectively. Since August 1992, he has been with the Department of Computer Information Science, National ChiaoTung University, Hsinchu, Taiwan, R.O.C. His current research interests are neural networks, fuzzy systems, learning systems, intelligent control, and artificial intelligence. Dr. Lin is a member of Tau Beta Pi and Eta Kappa Nu. He is also a member of the IEEE Computer Society, the IEEE Robotics and Automation Society, and the IEEE Systems, Man, and Cybernetics Society.
C. S. George Lee received the B.S. and M.S. degrees in electrical engineering from Washington State University in 1973 and 1974, respectively, and the Ph.D. degree from Purdue University, West Lafayette, IN, in 1978. In 19781979, he taught at Purdue University and, in 19791985, at the University of Michigan. Since 1985, he has been with the School of Electrical Engineering, Purdue University, where he is currently Professor of Electrical Engineering. His current research interests include computational robotics, intelligent robotic assembly systems, and neuralnetworkbased fuzzy logic control systems. He was an IEEE Computer Society Distinguished Visitor in 19831986, the Organizer and Chairman of the 1988 NATO Advanced Research Workshop on SensorBased Robots: Algorithms and Architectures, and the Secretary of the IEEE Robotics and Automation Society in 19881990. Currently, he is VicePresident for Technical Affairs and an AdCom member of the IEEE Robotics and Automation Society, and an Associate Editor of the International Journal of Robotics and Automation. He is a coauthor of Robotics: Control, Sensing, Vision, and Intelligence (McGrawHill), and coeditor of Tutorial on Robotics (IEEE Computer Society Press). Dr. Lee is a member of Sigma Xi and Tau Beta Pi.