 Castanon:1997p23922


Approximate dynamic programming for sensor management
D. Castanon
Decision and Control, 1997., Proceedings of the 36th IEEE Conference on
2
1202  1207 vol.2
(1997)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=657615&isnumber=14272&punumber=5239&k2dockey=657615@ieeecnfs
http://dx.doi.org/10.1109/CDC.1997.657615
This paper studies the problem of dynamic scheduling of multimode sensor resources for the problem of classification of multiple unknown objects. Because of the uncertain nature of the object types, the problem is formulated as a partially observed Markov decision problem with a large state space. The paper describes a hierarchical algorithm approach for efficient solution of sensor scheduling problems with large numbers of objects, based on a combination of stochastic dynamic programming and nondifferentiable optimization techniques. The algorithm is illustrated with an application involving classification of 10,000 unknown objects
Keywords: application, sensor management, object classification, dynamic scheduling of multimode sensor resources, RL

 SzeFongYau:2000p23917


Performance analysis of the approximate dynamic programming algorithm for parameter estimation of superimposed signals
S. F. Yau and Y. Bresler
Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on]
48
1274  1286
(2000)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=839975&isnumber=18164&punumber=78&k2dockey=839975@ieeejrns
http://dx.doi.org/10.1109/78.839975
We consider the classical problem of fitting a model composed of multiple superimposed signals to noisy data using the criteria of maximum likelihood (ML) or subspace fitting, jointly termed generalized subspace fitting (GSF). We analyze a previously proposed approximate dynamic programming algorithm (ADP), which provides a computationally efficient solution to the associated multidimensional multimodal optimization problem. We quantify the error introduced by the approximations in ADP and deviations from the key local interaction signal model (LISMO) modeling assumption in two ways. First, we upper bound the difference between the exact minimum of the GSF criterion and its value at the ADP estimate and compare the ADP with GSF estimates obtained by exhaustive multidimensional search on a fine lattice. Second, motivated by the similar accuracy bounds, we use perturbation analysis to derive approximate expressions for the MSE of the ADP estimates. These various results provide, for the first time, an effective tool to predict the performance of the ADP algorithm for various signal models at nonasymptotic conditions of interest in practical applications. In particular, they demonstrate that for the classical problems of sinusoid retrieval and array processing, ADP performs comparably to exact (but expensive) maximum likelihood (ML) over a wide range of signaltonoise ratios (SNRs) and is therefore an attractive algorithm
Keywords: inverse problem, search problem, application, RL, array processing

 Shervais:2003p23875


Intelligent supply chain management using adaptive critic learning
S. Shervais and T. Shannon and G. Lendaris
Systems, Man and Cybernetics, Part A, IEEE Transactions on
33
235  244
(2003)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1219461&isnumber=27401&punumber=3468&k2dockey=1219461@ieeejrns
http://dx.doi.org/10.1109/TSMCA.2003.809214
A set of neural networks is employed to develop control policies that are better than fixed, theoretically optimal policies, when applied to a combined physical inventory and distribution system in a nonstationary demand environment. Specifically, we show that modelbased adaptive critic approximate dynamic programming techniques can be used with systems characterized by discrete valued states and controls. The control policies embodied by the trained neural networks outperformed the best, fixed policies (found by either linear programming or genetic algorithms) in a highpenalty cost environment with timevarying demand.
Keywords: supply chain management, RL, application

 Bahati:2008p24269


Adapting to RunTime Changes in Policies Driving Autonomic Management
R. Bahati and M. Bauer
Autonomic and Autonomous Systems, 2008. ICAS 2008. Fourth International Conference on
88  93
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4488327&isnumber=4488306&punumber=4488305&k2dockey=4488327@ieeecnfs
http://dx.doi.org/10.1109/ICAS.2008.47
The use of policies within autonomic computing has received significant interest in the recent past. Policydriven management offers significant benefit since it makes it more straight forward to define and modify systems behavior at runtime, through policy manipulation, rather than through re engineering. In this paper, we present an adaptive policydriven autonomic management system which makes use of reinforcement learning methodologies to determine how to best use a set of active policies to meet different performance objectives. The focus, in particular, is on strategies for adapting what has been learned for one set of policy actions to a ";similar"; set of policies when runtime policy modifications occur. We illustrate the impact of the adaptation strategies on the behavior of a multicomponent Web server.
Keywords: autonomic computing, web services, RL, application

 Salkham:2008p23971


A Collaborative Reinforcement Learning Approach to Urban Traffic Control Optimization
A. Salkham and R. Cunningham and A. Garg and V. Cahill
Web Intelligence and Intelligent Agent Technology, 2008 IEEE/WIC/ACM International Conference on
2
560  566
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4740684&isnumber=4740584&punumber=4740404&k2dockey=4740684@ieeecnfs
http://dx.doi.org/10.1109/WIIAT.2008.88
The high growth rate of vehicles per capita now poses a real challenge to efficient Urban Traffic Control (UTC). An efficient solution to UTC must be adaptive in order to deal with the highlydynamic nature of urban traffic. In the near future, global positioning systems and vehicletovehicle/infrastructure communication may provide a more detailed local view of the traffic situation that could be employed for better global UTC optimization. In this paper we describe the design of a nextgeneration UTC system that exploits such local knowledge about a junction's traffic in order to optimize traffic control. Global UTC optimization is achieved using a local Adaptive Round Robin (ARR) phase switching model optimized using Collaborative Reinforcement Learning (CRL). The design employs an ARRCRLbased agent controller for each signalized junction that collaborates with neighbouring agents in order to learn appropriate phase timing based on the traffic pattern. We compare our approach to nonadaptive fixedtime UTC system and to a saturation balancing algorithm in a largescale simulation of traffic in Dublin's inner city centre. We show that the ARRCRL approach can provide significant improvement resulting in up to ~57% lower average waiting time per vehicle compared to the saturation balancing algorithm.
Keywords: application, traffic control, urban traffic, RL, phasetiming control

 Liu:2008p23751


Adaptive Critic Learning Techniques for Engine Torque and AirFuel Ratio Control
D. Liu and H. Javaherian and O. Kovalenko and T. Huang
Systems, Man, and Cybernetics, Part B, IEEE Transactions on
38
988  993
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4554033&isnumber=4567535&punumber=3477&k2dockey=4554033@ieeejrns
http://dx.doi.org/10.1109/TSMCB.2008.922019
A new approach for engine calibration and control is proposed. In this paper, we present our research results on the implementation of adaptive critic designs for selflearning control of automotive engines. A class of adaptive critic designs that can be classified as (modelfree) actiondependent heuristic dynamic programming is used in this research project. The goals of the present learning control design for automotive engines include improved performance, reduced emissions, and maintained optimum performance under various operating conditions. Using the data from a test vehicle with a V8 engine, we developed a neural network model of the engine and neural network controllers based on the idea of approximate dynamic programming to achieve optimal control. We have developed and simulated selflearning neural network controllers for both engine torque (TRQ) and exhaust airfuel ratio (AFR) control. The goal of TRQ control and AFR control is to track the commanded values. For both control problems, excellent neural network controller transient performance has been achieved.
Keywords: engine torque control, actorcritic, application, control of automotive engines, RL, exhaust airfuel ratio control

 Jodogne:2006p10601


Approximate Policy Iteration for ClosedLoop Learning of Visual Tasks
S. Jodogne and C. Briquet and J. Piater
ECML 2006
(2006)
http://www.springerlink.com/index/u7541v105r6t7gv0.pdf
pproximate Policy Iteration (API) is a reinforcement learning paradigm that is able to solve highdimensional, continuous control problems. We propose to exploit API for the closedloop learning of mappings from images to actions. This approach requires a family of function approximators that maps visual percepts to a realvalued function. For this purpose, we use Regression ExtraTrees, a fast, yet accurate and versatile machine learning algorithm. The inputs of the ExtraTrees consist of a set of visual features that digest the informative patterns in the visual signal. We also show how to parallelize the ExtraTree learning process to further reduce the computational expense, which is often essential in visual tasks. Experimental results on realworld images are given that indicate that the combination of API with ExtraTrees is a promising framework for the interactive learning of visual tasks.
Keywords: visual navigation, RL, application

 Mori:2005p10821


OffPolicy Natural ActorCritic
T. Mori and Y. Nakamura and S. Ishii
IEIC Technical Report (Institute of Electronics
(2005)
http://isw3.naist.jp/IS/TechReport/report/2005007.ps
Recentlydeveloped Natural ActorCritic (NAC) [1] [2], which employs natural policy gradient learning for the actor and LSTDQ() for the critic, has provided a good modelfree reinforcement learning scheme applicable to highdimensional systems. Since NAC is an onpolicy learning method, however, a new sample sequence is required for estimating sufficient statistics in the policy gradient after the current policy or the policy's parameterization is modified, or the gradient is estimated as biased. Moreover, the control of exploration and exploitation should be performed by a direct operation on the policy, which may be a large constraint on introducing an exploratory factor. To overcome these problems, we propose an offpolicy NAC in this article, in which the policy gradient is estimated without any asymptotic bias by using past system trajectories under the control by past policies. In addition, this method allows the exploration to be controlled from the outside of the policy optimization; this is more effective than the direct policy operation. We also propose a hierarchical parameterization of the policy and an offpolicy NAC learning for that policy. Computer experiments using a snakelike robot simulator show our new offpolicy NAC is so effective that the number of required trajectories is much smaller than that by the onpolicy method. Moreover, even when the policy parameterization is high dimensional, our hierarchical offpolicy NAC learning successfully obtains a good policy, by controlling the policy's effective dimensionality dynamically. These improvements allow a more efficient application in highdimensional control problems.
Keywords: snakelike robot, RL, application

 ElFakdi:2008p24405


Policy gradient based Reinforcement Learning for real autonomous underwater cable tracking
A. ElFakdi and M. Carreras
Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference on
3635  3640
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4650873&isnumber=4650570&punumber=4637508&k2dockey=4650873@ieeecnfs
http://dx.doi.org/10.1109/IROS.2008.4650873
This paper proposes a field application of a highlevel reinforcement learning (RL) control system for solving the action selection problem of an autonomous robot in cable tracking task. The learning system is characterized by using a direct policy search method for learning the internal state/action mapping. Policy only algorithms may suffer from long convergence times when dealing with real robotics. In order to speed up the process, the learning phase has been carried out in a simulated environment and, in a second step, the policy has been transferred and tested successfully on a real robot. Future steps plan to continue the learning process online while on the real robot while performing the mentioned task. We demonstrate its feasibility with real experiments on the underwater robot ICTINEU AUV.
Keywords: underwater cable tracking, RL, application

 PaterninaArboleda:2008p24039


Simulationoptimization using a reinforcement learning approach
C. PaterninaArboleda and J. MontoyaTorres and A. FabregasAriza
Simulation Conference, 2008. WSC 2008. Winter
1376  1383
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4736213&isnumber=4736042&punumber=4712635&k2dockey=4736213@ieeecnfs
http://dx.doi.org/10.1109/WSC.2008.4736213
The global optimization of complex systems such as industrial systems often necessitates the use of computer simulation. In this paper, we suggest the use of reinforcement learning (RL) algorithms and artificial neural networks for the optimization of simulation models. Several types of variables are taken into account in order to find global optimum values. After a first evaluation through mathematical functions with known optima, the benefits of our approach are illustrated through the example of an inventory control problem frequently found in manufacturing systems. Singleitem and multiitem inventory cases are considered. The efficiency of the proposed procedure is compared against a commercial tool.
Keywords: inventory control, RL, application

 Kephart:2007p4269


Coordinating multiple autonomic managers to achieve specified powerperformance tradeoffs
J. Kephart and H. Chan and R. Das and D. Levine and G. Tesauro and F. Rawson and C. Lefurgy
Proceedings of the Fourth International Conference on Autonomic Computing
24
(2007)
Keywords: power management, RL, application

 Murray:2001p23935


Globally convergent approximate dynamic programming applied to an autolander
J. Murray and C. Cox and R. Saeks and G. Lendaris
American Control Conference, 2001. Proceedings of the 2001
4
2901  2906 vol.4
(2001)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=946342&isnumber=20472&punumber=7520&k2dockey=946342@ieeecnfs
http://dx.doi.org/10.1109/ACC.2001.946342
A globally convergent nonlinear approximate dynamic programming algorithm is described, and an implementation of the algorithm in the linear case is developed. The resultant linear approximate dynamic programming algorithm is illustrated via the design of an autolander for the NASA X43 research aircraft, without a priori knowledge of the X43's flight dynamics
Keywords: autolander, aircraft control, RL, application

 QinminYang:2008p23987


A Suite of Robust Controllers for the Manipulation of Microscale Objects
Q. Yang and S. Jagannathan
Systems, Man, and Cybernetics, Part B, IEEE Transactions on
38
113  125
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4389965&isnumber=4432675&punumber=3477&k2dockey=4389965@ieeejrns
http://dx.doi.org/10.1109/TSMCB.2007.909943
A suite of novel robust controllers is introduced for the pickup operation of microscale objects in a microelectromechanical system (MEMS). In MEMS, adhesive, surface tension, friction, and van der Waals forces are dominant. Moreover, these forces are typically unknown. The proposed robust controller overcomes the unknown contact dynamics and ensures its performance in the presence of actuator constraints by assuming that the upper bounds on these forces are known. On the other hand, for the robust adaptive criticbased neural network (NN) controller, the unknown dynamic forces are estimated online. It consists of an action NN for compensating the unknown system dynamics and a critic NN for approximating a certain strategic utility function and tuning the action NN weights. By using the Lyapunov approach, the uniform ultimate boundedness of the closedloop manipulation error is shown for all the controllers for the pickup task. To imitate a practical system, a few system states are considered to be unavailable due to the presence of measurement noise. An output feedback version of the adaptive NN controller is proposed by exploiting the separation principle through a highgain observer design. The problem of measurement noise is also overcome by constructing a reference system. Simulation results are presented and compared to substantiate the theoretical conclusions.
Keywords: microactuators, RL, micromanipulators, application

 Frampton:2008p24259


Using dialogue acts to learn better repair strategies for spoken dialogue systems
M. Frampton and O. Lemon
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on
5045  5048
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4518792&isnumber=4517521&punumber=4505270&k2dockey=4518792@ieeecnfs
http://dx.doi.org/10.1109/ICASSP.2008.4518792
Repair or errorrecovery strategies are an important design issue in spoken dialogue systems (SDSs)  how to conduct the dialogue when there is no progress (e.g. due to repeated ASR errors). Nearly all current SDSs use handcrafted repair rules, but a more robust approach is to use reinforcement learning (RL) for datadriven dialogue strategy learning. However, as well as usually being tested only in simulation, current RL approaches use small state spaces which do not contain linguistically motivated features such as "dialogue acts" (DAs). We show that a strategy learned with DA features outperforms handcrafted and slotstatus strategies when tested with real users (+9% average task completion, p < 0.05). We then explore how using DAs produces better repair strategies e.g. focusswitching. We show that DAs are useful in de.ciding both when to use a repair strategy, and which one to use.
Keywords: spoken dialogue systems, RL, application

 Peroni:2005p23889


Optimal control of a fedbatch bioreactor using simulationbased approximate dynamic programming
C. Peroni and N. Kaisare and J. Lee
Control Systems Technology, IEEE Transactions on
13
786  790
(2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1501862&isnumber=32226&punumber=87&k2dockey=1501862@ieeejrns
http://dx.doi.org/10.1109/TCST.2005.852105
In this brief, we extend the simulationbased approximate dynamic programming (ADP) method to optimal feedback control of fedbatch reactors. We consider a freeend problem, wherein the batch time is considered in finding the optimal feeding strategy in addition to the final time productivity. In ADP, the optimal solution is parameterized in the form of profittogo function. The original definition of profittogo is modified to include the decision of batch termination. Simulations from heuristic feeding policies generate the initial profittogo versus state data. An artificial neural network then approximates profittogo as a function of process state. Iterations of the Bellman equation are used to improve the profittogo function approximator. The profittogo function approximator thus obtained, is then implemented in an online controller. This method is applied to cloned invertase expression in Saccharomyces cerevisiae in a fedbatch bioreactor.
Keywords: fedbatch reactors, biorectors, RL, application

 Esfahani:2008p24015


Widest KShortest Paths QRouting: A New QoS Routing Algorithm in Telecommunication Networks
A. Esfahani and M. Analoui
Computer Science and Software Engineering, 2008 International Conference on
4
1032  1035
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4722795&isnumber=4722542&punumber=4721667&k2dockey=4722795@ieeecnfs
http://dx.doi.org/10.1109/CSSE.2008.1264
Actually, various kinds of sources (such as voice, video or data) with diverse traffic characteristics and quality of service requirements (QoS), which are multiplexed at very high rates, leads to significant traffic problems such as packet losses, transmission delays, delay variations, etc, caused mainly by congestion in the networks. The prediction of these problems in real time is quite difficult, making the effectiveness of "traditional" methodologies based on analytical models questionable. This article proposed and evaluates a QoS routing policy in packets topology and irregular traffic of communications network called widest Kshortest paths Qrouting. The technique used for the evaluation signals of reinforcement is Qlearning. Compared to standard Qrouting, the exploration of paths is limited to K best non loop paths in term of hops number (number of routers in a path) leading to a substantial reduction of convergence time. In this work a proposal for routing which improves the delay factor and is based on the reinforcement learning is concerned. We use Qlearning as the reinforcement learning technique and introduce Kshortest idea into the learning process. The proposed algorithm are applied to two different topologies. The OPNET is used to evaluate the performance of the proposed algorithm. The algorithm evaluation is done for two traffic conditions, namely low load and high load.
Keywords: network routing, RL, application

 Ranjan:2008p24387


Reinforcement learning for dynamic channel allocation in mobile cellular systems
R. Ranjan and A. Phophalia
Recent Advances in Microwave Theory and Applications, 2008. MICROWAVE 2008. International Conference on
924  927
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4763228&isnumber=4762962&punumber=4749420&k2dockey=4763228@ieeecnfs
http://dx.doi.org/10.1109/AMTA.2008.4763228
In cellular communication systems, an important problem is to allocate the communication resource (bandwidth) so as to maximize the service provided to a set of mobile callers whose demand for service changes randomly. This problem is formulated as a dynamic programming problem and we use a reinforcement learning (RL) method to find dynamic channel allocation policies that are better than previous heuristic solutions. The policies obtained perform well for a broad variety of call traffic patterns. The superior performance of the proposed technique in terms of empirical blocking probability is illustrated in simulation examples.
Keywords: application, dynamic channel allocation, mobile radio, RL, cellular radio

 Draper:1996p10670


Modeling object recognition as a Markov decision process
B. A. Draper
ICPR'96
(1996)
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=547241
Keywords: vision, rooftop recognition, RL, application

 Padhi:2003p23907


Approximate dynamic programming based optimal neurocontrol synthesis of a chemical reactor process using proper orthogonal decomposition
R. Padhi and S. Balakrishnan
Neural Networks, 2003. Proceedings of the International Joint Conference on
3
1891  1896 vol.3
(2003)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1223696&isnumber=27485&punumber=8672&k2dockey=1223696@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2003.1223696
The concept of approximate dynamic programming and adaptive critic neural network based optimal controller is extended in this study to include systems governed by partial differential equations. An optimal controller is synthesized for a dispersion type tubular chemical reactor, which is governed by two coupled nonlinear partial differential equations. It consists of three steps: First, empirical basis functions are designed using the 'Proper Orthogonal Decomposition' technique and a loworder lumped parameter system to represent the infinitedimensional system is obtained by carrying out a Galerkin projection. Second, approximate dynamic programming technique is applied in a discrete time framework, followed by the use of a dual neural network structure called adaptive critics, to obtain optimal neurocontrollers for this system. In this structure, one set of neural networks captures the relationship between the state variables and the control, whereas the other set captures the relationship between the state and the costate variables. Third, the lumped parameter control is then mapped back to the spatial dimension using the same basis functions to result in a feedback control. Numerical results are presented that illustrate the potential of this approach. It should be noted that the procedure presented in this study can be used in synthesizing optimal controllers for a fairly general class of nonlinear distributed parameter systems.
Keywords: chemical reactor, RL, application

 Wen:2009p23963


A Machine Learning Method for Dynamic Traffic Control and Guidance on Freeway Networks
K. Wen and S. Qu and Y. Zhang
Informatics in Control, Automation and Robotics, 2009. CAR '09. International Asia Conference on
67  71
(2009)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4777196&isnumber=4777174&punumber=4777173&k2dockey=4777196@ieeecnfs
http://dx.doi.org/10.1109/CAR.2009.96
A distributed approach to Reinforcement Learning in tasks of ramp metering and dynamic route guidance is presented. The problem domain, a freeway integration control application, is formulated as a distributed reinforcement learning problem. The DRL .....
Keywords: freeway network, application, traffic control, ramp metering, RL, dynamic guidance

 Tsitsiklis:1999p8185


Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing highdimensional financial derivatives
J. Tsitsiklis and B. van Roy
Automatic Control, IEEE Transactions on
44
18401851
(1999)
Keywords: pricing, Asian option, RL, application

 WeiminLi:2008p23743


Dynamic energy management for hybrid electric vehicle based on approximate dynamic programming
W. Li and G. Xu and Z. Wang and Y. Xu;
Intelligent Control and Automation, 2008. WCICA 2008. 7th World Congress on
7864  7869
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4594156&isnumber=4592780&punumber=4577718&k2dockey=4594156@ieeecnfs
http://dx.doi.org/10.1109/WCICA.2008.4594156
In this paper, an approximate dynamic programming (ADP) based strategy for realtime energy control of parallel hybrid electric vehicles(HEV) is presented. The aim is to develop a fueloptimal control which is not relying on the priori knowledge of the future driving conditions (global optimal control), but only upon the current system operation. Approximate dynamic programming is an online learning method, which controls the system while simultaneously learning its characteristics in real time. A suboptimal energy control is then obtained with a proper definition of a cost function to be minimized at each time instant. The cost function includes the fuel consumption, emissions and the deviation of battery soc. Our approach guarantees an optimization of vehicle performance and an adaptation to driving conditions. Simulation results over standard driving cycles are presented to demonstrate the effectiveness of the proposed stochastic approach. It was found that the obtained ADP control algorithm outperforms a traditional rulebased control strategy.
Keywords: fueloptimal control of automotive engines, parallel hybrid electric vehicles, RL, application

 XinXu:2008p23724


Selflearning pathtracking control of autonomous vehicles using kernelbased approximate dynamic programming
X. Xu and H. Zhang and B. Dai and H. He;
Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on
2182  2189
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4634099&isnumber=4633757&punumber=4625775&k2dockey=4634099@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2008.4634099
With the fast development of robotics and intelligent vehicles, there has been much research work on modeling and motion control of autonomous vehicles. However, due to model complexity, and unknown disturbances from dynamic environment, the motion control of autonomous vehicles is still a difficult problem. In this paper, a novel selflearning pathtracking control method is proposed for a carlike robotic vehicle, where kernelbased approximate dynamic programming (ADP) is used to optimize the controller performance with little prior knowledge on vehicle dynamics. The kernelbased ADP method is a recently developed reinforcement learning algorithm called kernel leastsquares policy iteration (KLSPI), which uses kernel methods with automatic feature selection in policy evaluation to get better generalization performance and learning efficiency. By using KLSPI, the lateral control performance of the robotic vehicle can be optimized in a selflearning and datadriven style. Compared with previous learning control methods, the proposed method has advantages in learning efficiency and automatic feature selection. Simulation results show that the proposed method can obtain an optimized pathtracking control policy only in a few iterations, which will be very practical for real applications.
Keywords: mobile robotics, RL, application

 DerongLiu:2005p23877


A selflearning call admission control scheme for CDMA cellular networks
D. Liu and Y. Zhang and H. Zhang;
Neural Networks, IEEE Transactions on
16
1219  1228
(2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1510721&isnumber=32346&punumber=72&k2dockey=1510721@ieeejrns
http://dx.doi.org/10.1109/TNN.2005.853408
n the present paper, a call admission control scheme that can learn from the network environment and user behavior is developed for code division multiple access (CDMA) cellular networks that handle both voice and data services. The idea is built upon a novel learning control architecture with only a single module instead of two or three modules in adaptive critic designs (ACDs). The use of adaptive critic approach for call admission control in wireless cellular networks is new. The call admission controller can perform learning in realtime as well as in offline environments and the controller improves its performance as it gains more experience. Another important contribution in the present work is the choice of utility function for the present selflearning control approach which makes the present learning process much more efficient than existing learning control methods. The performance of our algorithm will be shown through computer simulation and compared with existing algorithms.
Keywords: CDMA cellular networks, call admission control, RL, application

 Gosavi:2004p8079


A reinforcement learning algorithm based on policy iteration for average reward: Empirical results with yield management and convergence analysis
A. Gosavi
Mach Learn
55
529
(2004)
Keywords: airline yield management, RL, application

 Marbach:1998p17964


Reinforcement Learning for Call Admission Control and Routing in Integrated Service Networks
P. Marbach and O. Mihatsch and M. Schulte and J. Tsitsiklis
Advances in Neural Information Processing Systems
(1998)
http://www.cs.toronto.edu/~marbach/PUBL/nips97.ps.gz
Keywords: routing, call admission control, RL, application

 Williams:2007p23764


Approximate Dynamic Programming for CommunicationConstrained Sensor Network Management
J. Williams and J. Fisher and A. Willsky
Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on]
55
4300  4311
(2007)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4276992&isnumber=4276952&punumber=78&k2dockey=4276992@ieeejrns
http://dx.doi.org/10.1109/TSP.2007.896099
Resource management in distributed sensor networks is a challenging problem. This can be attributed to the fundamental tradeoff between the value of information contained in a distributed set of measurements versus the energy costs of acquiring measurements, fusing them into the conditional probability density function (pdf) and transmitting the updated conditional pdf. Communications is commonly the highest contributor among these costs, typically by orders of magnitude. Failure to consider this tradeoff can significantly reduce the operational lifetime of a sensor network. While a variety of methods have been proposed that treat a subset of these issues, the approaches are indirect and usually consider at most a single time step. In the context of object tracking with a distributed sensor network, we propose an approximate dynamic programming approach that integrates the value of information and the cost of transmitting data over a rolling time horizon. We formulate this tradeoff as a dynamic program and use an approximation based on a linearization of the sensor model about a nominal trajectory to simultaneously find a tractable solution to the leader node selection problem and the sensor subset selection problem. Simulation results demonstrate that the resulting algorithm can provide similar estimation performance to that of the common most informative sensor selection method for a fraction of the communication cost.
Keywords: resource management, sensor network, RL, application

 JiaMa:2008p23765


Dual Heuristic Programming Based Neurocontroller for Vibration Isolation Control
J. Ma and T. Yang and Z. Hou and M. Tan and D. Liu;
Networking, Sensing and Control, 2008. ICNSC 2008. IEEE International Conference on
874  879
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4525339&isnumber=4525150&punumber=4489617&k2dockey=4525339@ieeecnfs
http://dx.doi.org/10.1109/ICNSC.2008.4525339
The dual heuristic programming (DHP) approach has a superior capability to solve approximate dynamic programming problems in the family of adaptive critic designs (ACDs). In this paper, a DHP based controller is developed for vibration isolation applications. In the specific case of an activepassive isolator, a multilayer feedforward neural network is pre trained as a differentiable model of the plant for the adaptation of the critic and action networks. In addition, in order to avoid plunging in the local minima during the training process, pseudorandom signals, which represent the vibrations of the base, are applied to the vibration isolation system. This technique greatly improves the robustness of the DHP controller against unknown disturbances. Moreover, the "shadow critic" training strategy is adopted to improve the convergence rate of the training. Simulation results have shown that the developed DHP controller alleviates vibration disturbances more effectively and have better control performance in comparison with the passive isolator. Additionally, as compared with the one adapted online, the pretrained model network leads the training to be more efficient. Therefore, it further demonstrates the effectiveness of the DHP controller designed with the pretrained model network even in the presence of unmodeled uncertainties.
Keywords: vibration isolation, RL, application

 WeiSongLin:2008p23733


Adaptive critic motion controller based on sparse radial basis function network
W. Lin and C. Tu;
Automation Congress, 2008. WAC 2008. World
1  9
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4698998&isnumber=4698939&punumber=4681917&k2dockey=4698998@ieeecnfs
Motion controllers capable of incremental learning and optimization can automatically tune their parameters to pursue optimal control. By implementing reinforcement learning and approximate dynamic programming, an adaptive critic motion controller is shown able to achieve this objective. The control policy and the adaptive critic are implemented by sparse radial basis function networks. The policy and the critic updating rules are derived. Ability and performance of the adaptive critic motion controller is demonstrated by the control of a rotary inverted pendulum system.
Keywords: rotary inverted pendulum, RL, application

 Rashid:2008p24379


Investigating a Dynamic Loop Scheduling with Reinforcement Learning Approach to Load Balancing in Scientific Applications
M. Rashid and I. Banicescu and R. Cario
Parallel and Distributed Computing, 2008. ISPDC '08. International Symposium on
123  130
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4724238&isnumber=4724214&punumber=4724213&k2dockey=4724238@ieeecnfs
http://dx.doi.org/10.1109/ISPDC.2008.25
The advantages of integrating reinforcement learning (RL) techniques into scientific parallel timestepping applications have been revealed in research work over the past few years. The object of the integration was to automatically select the most appropriate dynamic loop scheduling (DLS) algorithm from a set of available algorithms with the purpose of improving the application performance via load balancing during the application execution. This paper investigates the performance of such a dynamic loop scheduling with reinforcement learning (DLSwithRL) approach to load balancing. The DLSwithRL is most suitable for use in timestepping scientific applications with large number of steps. The RL agent's characteristics depend on a learning rate parameter and a discount factor parameter. An application simulating wavepacket dynamics that incorporates a DLSwithRL approach is allowed to execute on a cluster of workstations to investigate the influences of these parameters. The RL agent implemented two RL algorithms: QLEARN and SARSA learning. Preliminary results indicate that on a fixed number of processors, the simulation completion time is not sensitive to the values of the learning parameters used in the experiments. The results also indicate that for this application, there is no advantage of choosing one RL technique over another, even though the techniques differed significantly in the number of times they selected the various DLS algorithms.
Keywords: scientific parallel timestepping application, resource allocation, application, processor scheduling, RL

 Papadaki:2005p23883


Resource management in CDMA networks based on approximate dynamic programming
K. Papadaki and V. Friderikos
Local and Metropolitan Area Networks, 2005. LANMAN 2005. The 14th IEEE Workshop on
6 pp.  6
(2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1541536&isnumber=32915&punumber=10349&k2dockey=1541536@ieeecnfs
http://dx.doi.org/10.1109/LANMAN.2005.1541536
In this paper a power and rate control scheme for downlink packet transmission in CDMA networks is proposed. Under the assumption of stochastic packet arrivals and channel states the base station transmits to multiple mobile user at any time instant within rate and power capacity constraints. The objective is to maximize system throughput, while taking into account the queue length distribution over a time horizon. We are interested in optimal rate allocation policies over time and thus we formulate the problem as a discrete stochastic dynamic program. This dynamic program (DP) is exponentially complex in the number of users which renders it impractical and therefore we use an approximate dynamic programming algorithm to obtain in real time suboptimal rate allocation policies. Numerical results reveal that the proposed algorithm increased the performance (in terms of a number of different measured parameters such as average queue size) of at least 3.5 times compared to a number of different baseline greedy heuristics
Keywords: downlink packet transmission, application, CDMA networks, RL, rate control, power

 Lu:2008p23750


Direct Heuristic Dynamic Programming for Damping Oscillations in a Large Power System
C. Lu and J. Si and X. Xie
Systems, Man, and Cybernetics, Part B, IEEE Transactions on
38
1008  1013
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4554034&isnumber=4567535&punumber=3477&k2dockey=4554034@ieeejrns
http://dx.doi.org/10.1109/TSMCB.2008.923157
This paper applies a neuralnetworkbased approximate dynamic programming method, namely, the direct heuristic dynamic programming (direct HDP), to a large power system stability control problem. The direct HDP is a learning and approximationbased approach to addressing nonlinear coordinated control under uncertainty. One of the major design parameters, the controller learning objective function, is formulated to directly account for networkwide lowfrequency oscillation with the presence of nonlinearity, uncertainty, and coupling effect among system components. Results include a novel learning control structure based on the direct HDP with applications to two power system problems. The first case involves static var compensator supplementary damping control, which is used to provide a comprehensive evaluation of the learning control performance. The second case aims at addressing a difficult complex system challenge by providing a new solution to a large interconnected power network oscillation damping control problem that frequently occurs in the China Southern Power Grid.
Keywords: oscillation damping, power system stabilization, application, power network, RL, static var compensator supplementary damping control

 DiGiovanna:2009p24202


Coadaptive BrainMachine Interface via Reinforcement Learning
J. D. {\$}^*{\$} and B. Mahmoudi and J. Fortes and J. Principe and J. Sanchez
Biomedical Engineering, IEEE Transactions on
56
54  64
(2009)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4540104&isnumber=4783498&punumber=10&k2dockey=4540104@ieeejrns
http://dx.doi.org/10.1109/TBME.2008.926699
This paper introduces and demonstrates a novel brainmachine interface (BMI) architecture based on the concepts of reinforcement learning (RL), coadaptation, and shaping. RL allows the BMI control algorithm to learn to complete tasks from interactions with the environment, rather than an explicit training signal. Coadaption enables continuous, synergistic adaptation between the BMI control algorithm and BMI user working in changing environments. Shaping is designed to reduce the learning curve for BMI users attempting to control a prosthetic. Here, we present the theory and in vivo experimental paradigm to illustrate how this BMI learns to complete a reaching task using a prosthetic arm in a 3D workspace based on the user's neuronal activity. This semisupervised learning framework does not require user movements. We quantify BMI performance in closedloop brain control over six to ten days for three rats as a function of increasing task difficulty. All three subjects coadapted with their BMI control algorithms to control the prosthetic significantly above chance at each level of difficulty.
Keywords: manmachine systems, braincomputer interfaces, RL, application

 Giupponi:2009p23983


Fuzzy Neural Control for EconomicDriven Radio Resource Management in Beyond 3G Networks
L. Giupponi and R. Agustý and J. PýrezRomero and O. Sallent
Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on
39
170  189
(2009)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4787023&isnumber=4787649&punumber=5326&k2dockey=4787023@ieeejrns
http://dx.doi.org/10.1109/TSMCC.2008.2001724
Joint radio resource management (JRRM) is the envisaged process aimed at optimizing the radio resource usage of wireless systems to satisfy the requirements of both the network operators and the users in the context of future generation wireless networks. In particular, this paper proposes a twolayered JRRM framework to improve the efficiency of multiradio and multioperator cellular scenarios. On the one hand, the intraoperator JRRM relies on fuzzy neural mechanisms with economicdriven reinforcement learning techniques to exploit radio resources within a singleoperator domain. Microeconomic concepts are included in the proposed approach so that user profile differentiation can be considered when making a JRRM decision. On the other hand, interoperator JRRM enables subscribers to obtain service through other operators, if the home operator network is blocked. Simulation results in a number of different scenarios show that interoperator agreements established in a cooperative scenario benefit both the operators and users, which enables efficient load management and increased operator revenue.
Keywords: joint radio resource management, RL, application

 Usynin:2008p24298


PrognosticsDriven Optimal Control for Equipment Performing in Uncertain Environment
A. Usynin and J. Hines and A. Urmanov
Aerospace Conference, 2008 IEEE
1  9
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4526626&isnumber=4526225&punumber=4505446&k2dockey=4526626@ieeecnfs
http://dx.doi.org/10.1109/AERO.2008.4526626
This paper discusses the problem of optimal control for systems performing in uncertain environments, where little information is available regarding the system dynamics. A reinforcement learning approach is proposed to tackle the problem. A particular method to incorporate Prognostics and Health Management information derived on the system of interest is proposed to improve the reinforcement learning routine. The ideas behind reinforcement learningbased search for optimal control strategies are outlined. A numerical example illustrating the benefits of using PHM information is given.
Keywords: reliability, aerospace control, RL, application

 Costa:2008p24329


RLBased Scheduling Strategies in Actual Grid Environments
B. Costa and I. Dutra and M. Mattoso
Parallel and Distributed Processing with Applications, 2008. ISPA '08. International Symposium on
572  577
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4725196&isnumber=4725093&punumber=4725092&k2dockey=4725196@ieeecnfs
http://dx.doi.org/10.1109/ISPA.2008.119
In this work, we study the behaviour of different resource scheduling strategies when doing job orchestration in grid environments. We empirically demonstrate that scheduling strategies based on reinforcement learning are a good choice to improve the overall performance of grid applications and resource utilization.
Keywords: grid computing, scheduling, RL, application

 Powell:2005p23882


Approximate dynamic programming for high dimensional resource allocation problems
W. Powell and A. George and B. BouzaieneAyari and H. Simao
Neural Networks, 2005. IJCNN '05. Proceedings. 2005 IEEE International Joint Conference on
5
2989  2994 vol. 5
(2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1556401&isnumber=33093&punumber=10421&k2dockey=1556401@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2005.1556401
There are wide arrays of discrete resource allocation problems (buffers in manufacturing, complex equipment in electric power, aircraft and locomotives in transportation) which need to be solved over time, under uncertainty. These can be formulated as dynamic programs, but typically exhibit high dimensional state, action and outcome variables (the three curses of dimensionality). For example, we have worked on problems where the dimensionality of these variables is in the ten thousand to one million range. We describe an approximation methodology for this problem class, and summarize the problem classes where the approach seems to be working well, and research challenges that we continue to face.
Keywords: resource allocation, overview, RL, application

 Powell:2007p23827


The optimizingsimulator: Merging simulation and optimization using approximate dynamic programming
W. Powell
Simulation Conference, 2007 Winter
43  53
(2007)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4419587&isnumber=4419576&punumber=4419575&k2dockey=4419587@ieeecnfs
http://dx.doi.org/10.1109/WSC.2007.4419587
There is a wide range of simulation problems that involve making decisions during the simulation, where we would like to make the best decisions possible, taking into account not only what we know when we make the decision, but also the impact of the decision on the future. Such problems can be formulated as dynamic programs, stochastic programs and optimal control problems, but these techniques rarely produce computationally tractable algorithms. We demonstrate how the framework of approximate dynamic programming can produce nearoptimal (in some cases) or at least high quality solutions using techniques that are very familiar to the simulation community. The price of this challenge is that the simulation has to be run iteratively, using statistical learning techniques to produce the desired intelligence. The benefit is a reduced dependence on more traditional rulebased logic.
Keywords: overview, RL, application

 Jasmin:2008p23950


A Reinforcement Learning algorithm to economic dispatch considering transmission losses
E. Jasmin and T. I. Ahamed and V. Jagathiraj
TENCON 2008  2008, TENCON 2008. IEEE Region 10 Conference
1  6
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4766652&isnumber=4766377&punumber=4753725&k2dockey=4766652@ieeecnfs
http://dx.doi.org/10.1109/TENCON.2008.4766652
Reinforcement Learning (RL) refers to a class of learning algorithms in which learning system learns which action to take in different situations by using a scalar evaluation received from the environment on performing an action. RL has been successfully applied to many multi stage decision making problem (MDP) where in each stage the learning systems decides which action has to be taken. Economic Dispatch (ED) problem is an important scheduling problem in power systems, which decides the amount of generation to be allocated to each generating unit so that the total cost of generation is minimized without violating system constraints. In this paper we formulate economic dispatch problem as a multi stage decision making problem. In this paper, we also develop RL based algorithm to solve the ED problem. The performance of our algorithm is compared with other recent methods. The main advantage of our method is it can learn the schedule for all possible demands simultaneously.
Keywords: economic dispatch, application, power generation scheduling, RL, power generation dispatch

 Chang:2007p16093


An asymptotically efficient simulationbased algorithm for finite horizon stochastic dynamic programming
H. Chang and M. Fu and J. Hu and S. Marcus
IEEE Trans. on Automatic Control
(2007)
http://www.isr.umd.edu/~Marcus/docs/samw_final.pdf
Keywords: finitehorizon inventory control problem, RL, application

 JunMoKang:1999p23919


Approximate dynamic programming solutions for lean burn engine aftertreatment
J. Kang and I. Kolmanovsky and J. Grizzle
Decision and Control, 1999. Proceedings of the 38th IEEE Conference on
2
1703  1708 vol.2
(1999)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=830269&isnumber=18006&punumber=6713&k2dockey=830269@ieeecnfs
http://dx.doi.org/10.1109/CDC.1999.830269
The competition to deliver fuel efficient and environmentally friendly vehicles is driving the automotive industry to consider ever more complex powertrain systems. Adequate performance of these new highly interactive systems can no longer be obtained through traditional approaches, which are intensive in hardware use and final control software calibration. The paper explores the use of dynamic programming to make modelbased design decisions for a lean burn, direct injection spark ignition engine, in combination with a three way catalyst and lean NOx trap aftertreatment system. The primary contribution is the development of a very rapid method to evaluate the tradeoffs in fuel economy and emissions for this novel powertrain system, as a function of design parameters and controller structure, over a standard emission test cycle
Keywords: fuel economy, engine control, RL, application

 Kovalenko:2004p23888


Neural network modeling and adaptive critic control of automotive fuelinjection systems
O. Kovalenko and D. Liu and H. Javaherian
Intelligent Control, 2004. Proceedings of the 2004 IEEE International Symposium on
368  373
(2004)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1387711&isnumber=30194&punumber=9528&k2dockey=1387711@ieeecnfs
http://dx.doi.org/10.1109/ISIC.2004.1387711
We investigate the applications of a class of adaptive critic designs that can be classified as (modelfree) actiondependent heuristic dynamic programming (ADHDP). Adaptive critic designs are defined as designs that approximate dynamic programming in the general case, i.e., approximate optimal control over time in noisy, nonlinear environment. The goals of the present learning control design for automotive engines include improved performance, reduced emissions, and maintained optimum performance under various operating conditions. Using the data obtained from a test vehicle, we first develop a neural network model of the engine. The neural network controller is then designed based on the idea of approximate dynamic programming to achieve optimal control. In the simulation studies, the initial controller is trained using the neural network engine model developed rather than the actual engine. We have developed and tested selflearning neural network controllers for both engine torque and exhaust airfuel ratio control. The goal of the engine torque control is to track the commanded torque. The objective of the airfuel ratio control is to regulate the engine airfuel ratio at specified set points. For both control problems, good transient performance of the neural network controller has been observed. A distinct feature of the present technique is the controller's realtime adaptation capability which allows the neural network controller to be further refined and improved in realtime vehicle operation through continuous learning and adaptation.
Keywords: automotive fuel injection, RL, application

 Ernst:2005p4189


Approximate Value Iteration in the Reinforcement Learning Context. Application to Electrical Power …
D. Ernst and M. Glavic and P. Geurts and L. Wehenkel
International Journal of Emerging Electric Power Systems
(2005)
http://www.bepress.com/ijeeps/vol3/iss1/art1066
. Application to Electrical Power System Control. Damien Ernst, University
Keywords: Electrical Power System Control, RL, application

 Das:1999p10438


Solving SemiMarkov Decision Problems Using Average Reward Reinforcement Learning
T. Das and A. Gosavi and S. Mahadevan and N. Marchalleck
MANAGEMENT SCIENCE
(1999)
http://www.cs.umass.edu/~mahadeva/papers/managementsci.ps.gz
Keywords: optimal preventive maintenance schedule, production inventory system, RL, application

 Koole:2005p23872


Approximate dynamic programming in multiskill call centers
G. Koole and A. Pot
Winter Simulation Conference, 2005 Proceedings of the
8
(2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1574297&isnumber=33294&punumber=10515&k2dockey=1574297@ieeecnfs
http://dx.doi.org/10.1109/WSC.2005.1574297
We consider a multiskill call center consisting of specialists and fully crosstrained agents. All traffic is inbound and there is a waiting queue for each skill type. Our objective is to obtain good call routing policies. In this paper we use the socalled policy iteration (PI) method. It is applied in the context of approximate dynamic programming (ADP). The standard PI method requires the exact value function, which is well known from dynamic programming. We remark that standard methods to obtain the value function suffer from the curse of dimensionality, i.e., the number of states grows exponentially with the size of the call center. Therefore, we replace the real value function by an approximation and we use techniques from ADP. The value function is approximated by simulating the system. We apply this method to our call routing problem, yielding very good results.
Keywords: call routing, RL, application

 Valasek:2008p23972


Improved AdaptiveReinforcement Learning Control for Morphing Unmanned Air Vehicles
J. Valasek and J. Doebbler and M. Tandale and A. Meade
Systems, Man, and Cybernetics, Part B, IEEE Transactions on
38
1014  1020
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4554213&isnumber=4567535&punumber=3477&k2dockey=4554213@ieeejrns
http://dx.doi.org/10.1109/TSMCB.2008.922018
This paper presents an improved AdaptiveReinforcement Learning Control methodology for the problem of unmanned air vehicle morphing control. The reinforcement learning morphing control function that learns the optimal shape change policy is integrated with an adaptive dynamic inversion control trajectory tracking function. An episodic unsupervised learning simulation using the Qlearning method is developed to replace an earlier and less accurate ActorCritic algorithm. Sequential Function Approximation, a Galerkinbased scattered data approximation scheme, replaces a KNearest Neighbors (KNN) method and is used to generalize the learning from previously experienced quantized states and actions to the continuous stateaction space, all of which may not have been experienced before. The improved method showed smaller errors and improved learning of the optimal shape compared to the KNN.
Keywords: morphing control, unmanned aircraft control, RL, application

 Hu:2008p24295


QELAR: A Qlearningbased EnergyEfficient and LifetimeAware Routing Protocol for Underwater Sensor Networks
T. Hu and Y. Fei
Performance, Computing and Communications Conference, 2008. IPCCC 2008. IEEE International
247  255
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4745119&isnumber=4745069&punumber=4712735&k2dockey=4745119@ieeecnfs
http://dx.doi.org/10.1109/PCCC.2008.4745119
Underwater sensor network (UWSN) has emerged as a promising network technique for various aquatic applications in recent years. Due to some constraints in UWSNs, such as high latency, low bandwidth and high energy consumption, it is challenging to build networking protocolsfor UWSNs. In this paper, we focus on addressing the routing issue in UWSNs. We propose an adaptive, energyefficient, and lifetimeaware routing protocol based on reinforcement learning, QELAR. Our protocol assumes generic MAC protocols and aims at prolonging the lifetime of networks by making residual energy of sensor nodes more evenly distributed. The residual energy of each node as well as the energy distribution among a group is factored in throughout the routing process to calculate the reward function, which aids in selecting the adequate forwarders for packets. We have performed extensive simulations of the proposed protocol on the Aquasim platform, and compared with one existing routing protocol (VBF) in terms of packet delivery rate, energy efcLency, latency and lifetime. The results show that the QELAR protocol yields 20% longer lifetime on average than VBF.
Keywords: routing, underwater sensor network, RL, application

 Powell:2003p4182


Approximate Dynamic Programming for Asset Management: Solving the curses of dimensionality
W. B. Powell
233
(2003)
Keywords: asset management, RL, application

 JongMinLee:2004p23886


Empirical model based control of nonlinear processes using approximate dynamic programming
J. M. Lee and J. Lee
American Control Conference, 2004. Proceedings of the 2004
4
3041  3046 vol.4
(2004)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1384375&isnumber=30159&punumber=9503&k2dockey=1384375@ieeecnfs
A major difficulty associated with using an empirical nonlinear model for modelbased control is that the model can be unduly extrapolated into regions of the state space where identification data were scarce or even nonexistent. Optimal control solutions obtained with such overextrapolations can result in performances far worse than predicted by the model. In the multistep predictive control setting, it is not straightforward to prevent such overuse of the model by forcing the optimizer to find a solution within the "trusted" regions of state space. Given the difficulty, we propose an approximate dynamic programming based approach for designing a modelbased controller that avoids such abusage of an empirical model with respect to the distribution of the identification data. The approach starts with closedloop test data obtained with some suboptimal controllers, e.g., PI controllers, and attempts to derive a new control policy that improves upon their performances. Iterative improvement based on successive closedloop testing is possible. A diabatic CSTR example is provided to illustrate the proposed approach.
Keywords: biabatic CSTR, tank reactor, application, RL, bioreactor control

 HongbingWang:2008p23948


RLPLA: A Reinforcement Learning Algorithm of Web Service Composition with Preference Consideration
H. Wang and P. Tang and P. Hung
Congress on Services Part II, 2008. SERVICES2. IEEE
163  170
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4700514&isnumber=4700469&punumber=4700468&k2dockey=4700514@ieeecnfs
http://dx.doi.org/10.1109/SERVICES2.2008.15
here are many static and dynamic Web services composition strategies, however literatures about automatic composition is very rare. In this article, a new algorithm based on reinforcement learning is proposed to realize web service composition automatically and randomly. On the other hand, the existing composition prototype systems mainly focus on functionoriented composition, but not QoSoriented composition. After understanding the functionoriented composition by reinforcement learning, this paper then introduces preference logic to seek a QoS optimization solution, which is some kind of qualitative solution. When compared with quantitative solution it has many advantages. The result is a novel algorithm RLPLA, which is an algorithm of Web services composition based on reinforcement learning and preference logic
Keywords: web services, quality of service, RL, application

 Nedich:2005p23879


Farsighted sensor management strategies for move/stop tracking
A. Nedich and M. Schneider and R. Washburn
Information Fusion, 2005 8th International Conference on
1
8
(2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1591905&isnumber=33512&punumber=10604&k2dockey=1591905@ieeecnfs
http://dx.doi.org/10.1109/ICIF.2005.1591905
We consider the sensor management problem arising in using a multimode sensor to track moving and stopped targets. The sensor management problem is to determine what measurements to take in time so as to optimize the utility of the collected data. Finding the best sequence of measurements is a hard combinatorial problem due to many factors, including the large number of possible sensor actions and the complexity of the dynamics. The complexity of the dynamics is due in part to the sensor dwelltime depending on the sensor mode, targets randomly starting and stopping, and the uncertainty in the sensor detection process. For such a sensor management problem, we propose a novel, computationally efficient, farsighted algorithm based on an approximate dynamic programming methodology. The algorithm's complexity is polynomial in the number of targets. We evaluate this algorithm against a myopic algorithm optimizing an informationtheoretic scoring criterion. Our simulation results indicate that the farsighted algorithm performs better with respect to the average time the track error is below a specified goal value.
Keywords: tracking of moving, application, sensor management, RL, multimode sensors, stopped targets

 Peters:2006p10138


Policy gradient methods for robotics
J. Peters and S. Schaal
Proceedings of the IEEE/RSJ International Conference on …
(2006)
http://wwwclmc.usc.edu/publications/P/petersIROS2006.pdf
Keywords: anthropomorphic arm, hitting a baseball, RL, application

 Abbeel:2006p16213


Using inaccurate models in reinforcement learning
P. Abbeel and M. Quigley and A. Ng
Proceedings of the 23rd international conference on Machine …
(2006)
http://portal.acm.org/citation.cfm?id=1143844.1143845
In the modelbased policy search approach to reinforcement learning (RL), policies are found using a model (or "simulator") of the Markov decision process. However, for highdimensional continuousstate tasks, it can be extremely difficult to build an accurate model, and thus often the algorithm returns a policy that works in simulation but not in reallife. The other extreme, modelfree RL, tends to require infeasibly large numbers of reallife trials. In this paper, we present a hybrid algorithm that requires only an approximate model, and only a small number of reallife trials. The key idea is to successively "ground" the policy evaluations using reallife trials, but to rely on the approximate model to suggest local changes. Our theoretical results show that this algorithm achieves nearoptimal performance in the real system, even when the model is only approximate. Empirical results also demonstrate thatwhen given only a crude model and a small number of reallife trialsour algorithm can obtain nearoptimal performance in the real system.
Keywords: DDP, application, flight simulator, RL, RC Car

 Borkar:2008p24319


A learning scheme for stationary probabilities of large markov chains with examples
V. Borkar and D. Das and A. Banik and D. Manjunath
Communication, Control, and Computing, 2008 46th Annual Allerton Conference on
1097  1099
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4797682&isnumber=4797526&punumber=4786970&k2dockey=4797682@ieeecnfs
http://dx.doi.org/10.1109/ALLERTON.2008.4797682
We describe a reinforcement learning based scheme to estimate the stationary distribution of subsets of states of large Markov chains. `Split sampling' ensures that the algorithm needs to just encode the state transitions and will not need to know any other property of the Markov chain. (An earlier scheme required knowledge of the column sums of the transition probability matrix.) This algorithm is applied to analyze the stationary distribution of the states of a node in an 802.11 network.
Keywords: communication network, RL, application

 Lee:2006p1549


Approximate dynamic programming based approach to process control and scheduling
J. Lee and J. Lee
Computers {\&} Chemical Engineering
30
16031618
(2006)
http://dx.doi.org/10.1016/j.compchemeng.2006.05.043
Keywords: process control, scheduling, RL, application

 Castaon:2008p23723


Team task allocation and routing in risky environments under human guidance
D. Castaon and D. Ahner
Decision and Control, 2008. CDC 2008. 47th IEEE Conference on
1139  1144
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4739148&isnumber=4738560&punumber=4721212&k2dockey=4739148@ieeecnfs
http://dx.doi.org/10.1109/CDC.2008.4739148
In this paper we design coordination policies for unmanned vehicles that select and perform tasks in uncertain environments where vehicles may fail. We develop algorithms that accept different levels of human guidance, from simple allocation of priorities through the use of task values to more complex task partitioning and load balancing techniques. The goal is to maximize expected value completed under human guidance. We develop alternative algorithms based on approximate dynamic programming versions appropriate for each level of guidance, and compare the resulting performance using simulation results.
Keywords: unmanned vehicle coordination, RL, application

 Achbany:2008p23958


Continually Learning Optimal Allocations of Services to Tasks
Y. Achbany and I. J. Jureta and S. Faulkner and F. Fouss
Services Computing, IEEE Transactions on
1
141  154
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4721292&isnumber=4768635&punumber=4629386&k2dockey=4721292@ieeejrns
http://dx.doi.org/10.1109/TSC.2008.12
Open serviceoriented systems which autonomously and continually satisfy users' service requests to optimal levels are an appropriate response to the need for increased automation of information systems. Given a service request, an open serviceoriented system interprets the functional and nonfunctional requirements laid out in the request and identifies the optimal selection of services{\&}{\#}x2014;that is, identifies services. These services' coordinated execution optimally satisfies the requirements in the request. When selecting services, it is relevant to: (1) revise selections as new services appear and others become unavailable; (2) use multiple criteria, including nonfunctional ones to choose among competing services; (3) base the comparisons of services on observed, instead of advertised performance; and (4) allow for uncertainty in the outcome of service executions. To address issues (1){\&}{\#}x2013;(4), we propose the MultiCriteria Randomized Reinforcement Learning (MCRRL) service selection approach. MCRRL learns and revises service selections using a novel multicriteriadriven (including quality of service parameters, deadline, reputation, cost, and preferences) reinforcement learning algorithm, which integrates the exploitation of data about individual services' past performance with optimal, undirected, continual exploration of new selections that involve services whose behavior has not been observed. The experiments indicate the algorithm behaves as expected and outperforms two standard approaches.
Keywords: optimal allocations, service to task mapping, RL, application

 Maes:2007p11103


Sequence Labeling with Reinforcement Learning and Ranking Algorithms
F. Maes and L. Denoyer and P. Gallinari
ECML07
(2007)
http://www.springerlink.com/index/v82h161v6760135j.pdf
Many problems in areas such as Natural Language Processing, Information Retrieval, or Bioinformatic involve the generic task of sequence labeling. In many cases, the aim is to assign a label to each element in a sequence. Until now, this problem has mainly been addressed with Markov models and Dynamic Programming.
We propose a new approach where the sequence labeling task is seen as a sequential decision process. This method is shown to be very fast with good generalization accuracy. Instead of searching for a globally optimal label sequence, we learn to construct this optimal sequence directly in a greedy fashion. First, we show that sequence labeling can be modelled using Markov Decision Processes, so that several Reinforcement Learning (RL) algorithms can be used for this task. Second, we introduce a new RL algorithm which is based on the ranking of local labeling decisions.
Keywords: Natural Language Chunking, Spanish Named Entity Recognition, application, Handwritten Recognition, RL, ranking, sequence labeling

 Lendaris:2003p23880


Controller design via adaptive critic and model reference methods
G. Lendaris and R. Santiago and J. McCarthy and M. Carroll
Neural Networks, 2003. Proceedings of the International Joint Conference on
4
3173  3178 vol.4
(2003)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1224080&isnumber=27487&punumber=8672&k2dockey=1224080@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2003.1224080
Dynamic programming (DP) is a principled way to design optimal controllers for certain classes of nonlinear systems; unfortunately, DP is computationally very expensive. The reinforcement learning methods known as adaptive critics (AC) provide comput.....
Keywords: aircraft control, control augmentation subsystem, application, hypersonic shaped airplane, RL

 Balakrishna:2008p24316


Estimating Taxiout times with a reinforcement learning algorithm
P. Balakrishna and R. Ganesan and L. Sherry and B. Levy
Digital Avionics Systems Conference, 2008. DASC 2008. IEEE/AIAA 27th
3.D.31  3.D.312
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4702812&isnumber=4702732&punumber=4694773&k2dockey=4702812@ieeecnfs
http://dx.doi.org/10.1109/DASC.2008.4702812
Flight delays have a significant impact on the nationpsilas economy. Taxiout delays in particular constitute a significant portion of the block time of a flight. In the future, it can be expected that accurate predictions of dasiawheelsoffpsila tim.....
Keywords: aircraft control, policy evaluation, application, taxiout time prediction, RL

 JihWenSheu:2008p23757


Designing Automatic Train Regulation for MRT system by adaptive critic method
J. Sheu and W. Lin;
Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on
406  412
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4633824&isnumber=4633757&punumber=4625775&k2dockey=4633824@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2008.4633824
Because of the disturbance of operation environment in mass rapid transit (MRT) system, the robustness against disturbance and the schedule punctuality under control constraint are important issues to be considered in designing Automatic Train Regulation (ATR) for MRT system. In this paper, the study on suitable traffic model for designing ATR system and ATR design based on adaptive critic design (ACD) of approximated dynamic programming, specifically on dual heuristic programming (DHP) are presented. Moreover, the method to deal with control constraint and applying gain scheduling to deal with the time variant environment of MRT system are addressed as well. For comparison, simulations with real operation environment data are done for ATRs designed by both adaptive critic method and LQ optimization method.
Keywords: automatic train regulation, mass rapid transit systems, RL, application

 Wang:2006p10294


Adaptive Routing for Sensor Networks using Reinforcement Learning
P. Wang and T. Wang
Proceedings of the Sixth IEEE International Conference on Computer and Information Technology
(2006)
http://doi.ieeecomputersociety.org/10.1109/CIT.2006.34
Efficient and robust routing is central to wireless sensor networks (WSN) that feature energyconstrained nodes, unreliable links, and frequent topology change. While most existing routing techniques are designed to reduce routing cost by optimizing one goal, e.g., routing path length, load balance, retransmission rate, etc, in real scenarios however, these factors affect the routing performance in a complex way, leading to the need of a more sophisticated scheme that makes correct tradeoffs.
In this paper, we present a novel routing scheme, AdaR that adaptively learns an optimal routing strategy, depending on multiple optimization goals. We base our approach on a least squares reinforcement learning technique, which is both data efficient, and insensitive against initial setting, thus ideal for the context of adhoc sensor networks. Experimental results suggest a significant performance gain over a naive Qlearning based implementation.
Keywords: routing, wireless sensor networks, RL, application

 Frazier:2008p23726


Approximate Dynamic Programming in Knowledge Discovery for Rapid Response
P. Frazier and W. Powell and S. Dayanik and P. Kantor
42nd Hawaii International Conference on System Sciences, 2009. HICSS '09.
1  10
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4755493&isnumber=4755314&punumber=4755313&k2dockey=4755493@ieeecnfs
http://dx.doi.org/10.1109/HICSS.2009.79
One knowledge discovery problem in the rapid response setting is the cost of learning which patterns are indicative of a threat. This typically involves a detailed followthrough, such as review of documents and information by a skilled analyst, or detailed examination of a vehicle at a border crossing point, in deciding which suspicious vehicles require investigation. Assessing various strategies and decision rules means we must compare not only the short term effectiveness of interrupting a specific traveler, or forwarding a specific document to an analyst, but we must also weigh the potential improvement in our profiles that results even from sending a "false alarm". We show that this problem can be recast as a dynamic programming problem with, unfortunately, a huge state space. Several specific heuristics are introduced to provide improved approximations to the solution. The problems of obtaining realworld data to sharpen the analysis are discussed briefly.
Keywords: threat management, RL, application

 Enns:2003p23903


Helicopter trimming and tracking control using direct neural dynamic programming
R. Enns and J. Si;
Neural Networks, IEEE Transactions on
14
929  939
(2003)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1215408&isnumber=27335&punumber=72&k2dockey=1215408@ieeejrns
http://dx.doi.org/10.1109/TNN.2003.813839
his paper advances a neuralnetworkbased approximate dynamic programming control mechanism that can be applied to complex control problems such as helicopter flight control design. Based on direct neural dynamic programming (DNDP), an approximate dynamic programming methodology, the control system is tailored to learn to maneuver a helicopter. The paper consists of a comprehensive treatise of this DNDPbased tracking control framework and extensive simulation studies for an Apache helicopter. A trim network is developed and seamlessly integrated into the neural dynamic programming (NDP) controller as part of a baseline structure for controlling complex nonlinear systems such as a helicopter. Design robustness is addressed by performing simulations under various disturbance conditions. All designs are tested using FLYRT, a sophisticated industrial scale nonlinear validated model of the Apache helicopter. This is probably the first time that an approximate dynamic programming methodology has been systematically applied to, and evaluated on, a complex, continuous state, multipleinput multipleoutput nonlinear system with uncertainty. Though illustrated for helicopters, the DNDP control system framework should be applicable to general purpose tracking control.
Keywords: Apache helicopter, helicopter control, RL, application

 JuJiang:2008p24258


Altitude control of aircraft using coefficientbased policy method
J. Jiang and H. Gong and J. Liu and H. Xu and Y. Chen;
Electrical and Computer Engineering, 2008. CCECE 2008. Canadian Conference on
000361  000364
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4564557&isnumber=4564482&punumber=4554522&k2dockey=4564557@ieeecnfs
http://dx.doi.org/10.1109/CCECE.2008.4564557
This paper proposes a coefficientbased policy searching method, the direct policy search (DPS), for searching (learning) and construct policies for controlling the altitude of an aircraft. The DPS is a new and efficient reinforcement learning (RL) strategy combined with genetic algorithms (GAs). Specifically, an optimal policy in DPS consists of a set of coefficients which are learned using GAbased RL (GARL). The proposed method for learning optimal policy is demonstrated in controlling the complicated altitude system of a Boeing 747 aircraft whose solution space consists of 20 variables. Simulation results show that this new approach produces competitive performances with the traditional algorithms such as the classical statefeedback algorithm and the pure RL algorithm.
Keywords: aircraft control, altitude control, application, RL, Boeing 747

 Peshkin:2007p17962


Reinforcement Learning for Adaptive Routing
L. Peshkin and V. Savova
Arxiv preprint cs
(2007)
http://arxiv.org/abs/cs/0703138
Reinforcement learning means learning a policya mapping of observations into actionsbased on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. We present an application of gradient ascent algorithm for reinforcement learning to a complex domain of packet routing in network communication and compare the performance of this algorithm to other routing methods on a benchmark problem.
Keywords: packet routing, RL, application

 Armstrong:2008p23943


Reinforcement learning for automated performance tuning: Initial evaluation for sparse matrix format selection
W. Armstrong and A. Rendell
Cluster Computing, 2008 IEEE International Conference on
411  420
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4663802&isnumber=4663724&punumber=4655410&k2dockey=4663802@ieeecnfs
http://dx.doi.org/10.1109/CLUSTR.2008.4663802
The field of reinforcement learning has developed techniques for choosing beneficial actions within a dynamic environment. Such techniques learn from experience and do not require teaching. This paper explores how reinforcement learning techniques might be used to determine efficient storage formats for sparse matrices. Three different storage formats are considered: coordinate, compressed sparse row, and blocked compressed sparse row. Which format performs best depends heavily on the nature of the matrix and the computer system being used. To test the above a program has been written to generate a series of sparse matrices, where any given matrix performs optimally using one of the three different storage types. For each matrix several sparse matrix vector products are performed. The goal of the learning agent is to predict the optimal sparse matrix storage format for that matrix. The proposed agent uses five attributes of the sparse matrix: the number of rows, the number of columns, the number of nonzero elements, the standard deviation of nonzeroes per row and the mean number of neighbours. The agent is characterized by two parameters: an exploration rate and a parameter that determines how the state space is partitioned. The ability of the agent to successfully predict the optimal storage format is analyzed for a series of 1,000 automatically generated test matrices.
Keywords: performance tuning of algorithms, RL, application

 Gosavi:2004p446


Reinforcement learning for longrun average cost
A. Gosavi
European Journal of Operational Research
155
654674
(2004)
http://dx.doi.org/10.1016/S03772217(02)008743
Keywords: application, maintenance problem, RL

 Richert:2008p24251


Increasing the Autonomy of Mobile Robots by Online Learning Simultaneously at Different Levels of Abstraction
W. Richert and O. Luke and B. Nordmeyer and B. Kleinjohann
Autonomic and Autonomous Systems, 2008. ICAS 2008. Fourth International Conference on
154  159
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4488338&isnumber=4488306&punumber=4488305&k2dockey=4488338@ieeecnfs
http://dx.doi.org/10.1109/ICAS.2008.26
We present a framework that is able to handle system and environmental changes by learning autonomously at different levels of abstraction. It is able to do so in continuous and noisy environments by 1) an active strategy learning module that uses reinforcement learning and 2) a dynamically adapting skill module that proactively explores the robot's own action capabilities and thereby providing actions to the strategy module. We present results that show the feasibility of simultaneously learning lowlevel skills and highlevel strategies in order to reach a goal while reacting to disturbances like hardware damages. Thereby, the robot drastically increases its overall autonomy.
Keywords: hierarchy, mobile robots, RL, application

 Ipek:2008p24343


SelfOptimizing Memory Controllers: A Reinforcement Learning Approach
E. Ipek and O. Mutlu and J. Martinez and R. Caruana
Computer Architecture, 2008. ISCA '08. 35th International Symposium on
39  50
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4556714&isnumber=4556700&punumber=4556699&k2dockey=4556714@ieeecnfs
http://dx.doi.org/10.1109/ISCA.2008.21
Efficiently utilizing offchip DRAM bandwidth is a critical issue in designing costeffective, highperformance chip multiprocessors (CMPs). Conventional memory controllers deliver relatively low performance in part because they often employ fixed, rigid access scheduling policies designed for averagecase application behavior. As a result, they cannot learn and optimize the longterm performance impact of their scheduling decisions,and cannot adapt their scheduling policies to dynamic workload behavior.We propose a new, selfoptimizing memory controller design that operates using the principles of reinforcement learning (RL)to overcome these limitations. Our RLbased memory controller observes the system state and estimates the longterm performance impact of each action it can take. In this way, the controller learns to optimize its scheduling policy on the fly to maximize longterm performance. Our results show that an RLbased memory controller improves the performance of a set of parallel applications run on a 4core CMP by 19% on average (upto 33%), and it improves DRAM bandwidth utilization by 22%compared to a stateoftheart controller.
Keywords: scheduling, application, access scheduling, microcontrollers, RL

 ChenglongFu:2008p23952


Gait Synthesis and Sensory Control of Stair Climbing for a Humanoid Robot
C. Fu and K. Chen;
Industrial Electronics, IEEE Transactions on
55
2111  2120
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4475434&isnumber=4505400&punumber=41&k2dockey=4475434@ieeejrns
http://dx.doi.org/10.1109/TIE.2008.921205
Stable and robust walking in various environments is one of the most important abilities for a humanoid robot. This paper addresses walking pattern synthesis and sensory feedback control for humanoid stair climbing. The proposed stairclimbing gait is formulated to satisfy the environmental constraint, the kinematic constraint, and the stability constraint; the selection of the gait parameters is formulated as a constrained nonlinear optimization problem. The sensory feedback controller is phase dependent and consists of the torso attitude controller, zero moment point compensator, and impact reducer. The online learning scheme of the proposed feedback controller is based on a policy gradient reinforcement learning method, and the learned controller is robust against external disturbance. The effectiveness of our proposed method was confirmed by walking experiments on a 32degreeoffreedom humanoid robot.
Keywords: gait synthesis, application, walking, stair climbing, RL, humanoid robots

 Kocsis:2001p18052


Learning Time Allocation Using Neural Networks
L. Kocsis and J. Uiterwijk and J. van den Herik
LECTURE NOTES IN COMPUTER SCIENCE
(2001)
http://www.springerlink.com/index/0PTV8ATA6EQ35CT6.pdf
The strength of a gameplaying program is mainly based on the adequacy of the evaluation function and the efficacy of the search algorithm. This paper investigates how temporal difference learning and genetic algorithms can be used to improve various decisions made during gametree search. The existent TD algorithms are not directly suitable for learning search decisions. Therefore we propose a modified update rule that uses the TD error of the evaluation function to shorten the lag between two rewards. The genetic algorithms can be applied directly to learn search decisions. For our experiments we selected the problem of time allocation from the set of search decisions. On each move the player can decide on a certain search depth, being constrained by the amount of time left. As testing ground, we used the game of Lines of Action, which has roughly the same complexity as Othello. From the results we conclude that both the TD and the genetic approach lead to good results when compared to the existent timeallocation techniques. Finally, a brief discussion of the issues that can emerge when the algorithms are applied to more complex search decisions is given.
Keywords: games, timeallocation, RL, application

 Richter:2009p10231


Traffic Light Scheduling using PolicyGradient Reinforcement Learning
S. Richter
abotea.rsise.anu.edu.au
()
http://abotea.rsise.anu.edu.au/satelliteeventsicaps07/dc/dc23.pdf
Keywords: traffic light scheduling, RL, application

 Yadav:2007p23747


Robust/Optimal Temperature Profile Control of a HighSpeed Aerospace Vehicle Using Neural Networks
V. Yadav and R. Padhi and S. Balakrishnan
Neural Networks, IEEE Transactions on
18
1115  1128
(2007)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4267724&isnumber=4267697&punumber=72&k2dockey=4267724@ieeejrns
http://dx.doi.org/10.1109/TNN.2007.899229
n approximate dynamic programming (ADP)based suboptimal neurocontroller to obtain desired temperature for a highspeed aerospace vehicle is synthesized in this paper. A 1D distributed parameter model of a fin is developed from basic thermal physics principles. ldquoSnapshotrdquo solutions of the dynamics are generated with a simple dynamic inversionbased feedback controller. Empirical basis functions are designed using the ldquoproper orthogonal decompositionrdquo (POD) technique and the snapshot solutions. A loworder nonlinear lumped parameter system to characterize the infinite dimensional system is obtained by carrying out a Galerkin projection. An ADPbased neurocontroller with a dual heuristic programming (DHP) formulation is obtained with a singlenetworkadaptivecritic (SNAC) controller for this approximate nonlinear model. Actual control in the original domain is calculated with the same POD basis functions through a reverse mapping. Further contribution of this paper includes development of an online robust neurocontroller to account for unmodeled dynamics and parametric uncertainties inherent in such a complex dynamic system. A neural network (NN) weight update rule that guarantees boundedness of the weights and relaxes the need for persistence of excitation (PE) condition is presented. Simulation studies show that in a fairly extensive but compact domain, any desired temperature profile can be achieved starting from any initial temperature profile. Therefore, the ADP and NNbased controllers appear to have the potential to become controller synthesis tools for nonlinear distributed parameter systems.
Keywords: aerospace vehicles, temperature profile control, RL, application

 Barty:2005p20124


Temporal difference learning with kernels for pricing american style options
K. Barty and J. Roy and C. Strugarek
Preprint
(2005)
http://www.optimizationonline.org/DB_FILE/2005/05/1133.pdf
We propose in this paper to study the problem of estimating the costtogo
function for an infinitehorizon discounted Markov chain with possibly con
tinuous state space. For implementation purposes, the state space is typically
discretized. As soon as the dimension of the state space becomes large, the
computation is no more practicable, a phenomenon referred to as the curse
of dimensionality. The approximation of dynamic programming problems is
therefore of ma jor importance.
A powerful method for dynamic programming, often referred to as neuro
dynamic programming, consists in representing the Bellman function as a
linear combination of a priori defined functions, called neurons. The choice
of the neurons represents a delicate operation since it requires to have an
idea of the optimal solution. Furthermore, in a classical learning algorithm
once the choice of these neurons is made it is no longer modified although the
amount of the available information concerning the solution increases along
the iterations. In other words, such algorithms are ``locked'' in the vector
subspace generated by these neurons. Consequently, the algorithm is not
able to reach the optimal solution if it does not belong to the neurons vector
subspace.
In this article, we propose an alternative approach very similar to tempo
ral differences, based on functional gradient descent and using an infinite
kernel basis. Our algorithm is a result of the combination of stochastic ap
proximation ideas, and nonparametric estimation concepts. Furthermore, our
algorithm, though aimed at infinite dimensional problems, is implementable
in practice. We prove the convergence of this algorithm under a few condi
tions, which are classical in stochastic approximation schemes. We conclude
by showing on examples how this algorithm can be used to solve both infinite
horizon discounted Markov chain problems, and bermudan option pricing.
Keywords: pricing, American option, RL, application

 Vengerov:2002p23920


Adaptive learning scheme for power control in wireless transmitters
D. Vengerov and N. Bambos and H. Berenji
Vehicular Technology Conference, 2002. Proceedings. VTC 2002Fall. 2002 IEEE 56th
4
2390  2394 vol.4
(2002)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1040649&isnumber=22309&punumber=8066&k2dockey=1040649@ieeecnfs
http://dx.doi.org/10.1109/VETECF.2002.1040649
We address the issue of powercontrolled shared channel access in future wireless networks supporting packetized data traffic. We formulate this problem using the dynamic programming framework and present a distributed approximate dynamic programming algorithm capable of finding the optimal policy. The main tradeoff facing a transmitter is to balance its current power level with future backlog in the presence of stochastically changing interference. Simulation experiments demonstrate that the algorithm can achieve significant performance gains over the standard power control approach.
Keywords: power control, packet radio networks, RL, application

 Williams:2005p23868


An approximate dynamic programming approach for communication constrained inference
J. Williams and J. Fisher and A. Willsky
Statistical Signal Processing, 2005 IEEE/SP 13th Workshop on
1202  1207
(2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1628778&isnumber=34164&punumber=10843&k2dockey=1628778@ieeecnfs
Resource management in distributed sensor networks is a
challenging problem. This can be attributed to the funda
mental tradeoff between the value of information contained
in a distributed set of measurements versus the energy costs
of acquiring the measurements, fusing them into a model of
uncertainty, and transmitting the resulting model. Commu
nications is commonly the highest contributor among these
costs, typically by orders of magnitude. Failure to consider
this tradeoff can significantly reduce the operational lifetime
of a sensor network. While a variety of methods have been
proposed that treat a subset of these issues, the approaches
are indirect and usually consider at most a single time step.
In the context of target tracking with a distributed sensor
network we propose an approximate dynamic programming
approach which integrates the value of information and the
cost of transmitting data over a rolling time horizon. Specif
ically, we consider tracking a single target and constrain the
problem such that, at any time, a single sensor, referred to
as the leader node, is activated to both sense and update the
probabilistic model. The issue of selecting which sensor
should be the leader at each time is of primary interest, as it
directly impacts the tradeoff between the estimation accu
racy and the cost of communicating the probabilistic model
from old leader node to new leader node. We formulate this
tradeoff as a dynamic program, and use an approximation
based on a linearization of the sensor model about a nomi
nal trajectory to find a tractable solution. Simulation results
demonstrate that the resulting algorithm can provide similar
estimation performance to that of the common most infor
mative sensor election method at a fraction of the commu
nication energy cost.
Keywords: resource management, distributed sensor networks, RL, application

 Pineau:2003p4682


Policycontingent abstraction for robust robot control
J. Pineau and G. Gordon and S. Thrun
Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence (UAI), Acapulco, Mexico
(2003)
Keywords: mobile robotic assistant, RL, application

 Wang:2008p23759


ADPBased Optimal Adaptive Waveform Selection in Cognitive Radar
B. Wang and J. Wang and J. Li
Intelligent Information Technology Application Workshops, 2008. IITAW '08. International Symposium on
788  790
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4732055&isnumber=4731858&punumber=4731857&k2dockey=4732055@ieeecnfs
http://dx.doi.org/10.1109/IITA.Workshops.2008.96
he problem of adaptive waveform selection can be viewed as a problem of stochastic dynamic programming. However, the computational cost of solution of optimality equations is very high as a result of large state variable space. So far few suitable scheduling algorithms have been proposed. To account for the aboved shortcoming, we can use approximate dynamic programming (ADP) to solve adaptive waveform selection problem. The scheduling algorithm we proposed can minimize the time taken to detect new targets, detect these targets in accordance with importance, and minimize the use of radar resources. Meanwhile we have a lower computational cost compared with the traditional algorithms. Finally, the whole paper is summarized.
Keywords: scheduling, application, RL, cognitive radar, adaptive waveform selection

 Galstyan:2004p17974


Resource allocation in the grid using reinforcement learning
A. Galstyan and K. Czajkowski and K. Lerman
Autonomous Agents and Multiagent Systems
(2004)
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1373688
[PDF]
Keywords: resource allocation, grid, RL, application

 Crites:1998p17978


Elevator Group Control Using Multiple Reinforcement Learning Agents
R. Crites and A. Barto
Mach Learn
(1998)
http://www.springerlink.com/index/H023RU5337451W57.pdf
Recent algorithmic and theoretical advances in reinforcement learning (RL) have attracted widespread interest. RL algorithms have appeared that approximate dynamic programming on an incremental basis. They can be trained on the basis of real or simulated experiences, focusing their computation on areas of state space that are actually visited during control, making them computationally tractable on very large problems. If each member of a team of agents employs one of these algorithms, a new collective learning algorithm emerges for the team as a whole. In this paper we demonstrate that such collective RL algorithms can be powerful heuristic methods for addressing largescale control problems.
Elevator group control serves as our testbed. It is a difficult domain posing a combination of challenges not seen in most multiagent learning research to date. We use a team of RL agents, each of which is responsible for controlling one elevator car. The team receives a global reward signal which appears noisy to each agent due to the effects of the actions of the other agents, the random nature of the arrivals and the incomplete observation of the state. In spite of these complications, we show results that in simulation surpass the best of the heuristic elevator control algorithms of which we are aware. These results demonstrate the power of multiagent RL on a very large scale stochastic dynamic optimization problem of practical utility.
Keywords: elevator group control, RL, application

 Shervais:2001p23932


Improving theoreticallyoptimal and quasioptimal inventory and transportation policies using adaptive critic based approximate dynamic programming
S. Shervais and T. Shannon
Neural Networks, 2001. Proceedings. IJCNN '01. International Joint Conference on
2
1008  1013 vol.2
(2001)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=939498&isnumber=20330&punumber=7474&k2dockey=939498@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2001.939498
We demonstrate the possibility of improving on theoreticallyoptimal fixed policies for control of physical inventory systems in a nonstationary fitness terrain, based on the combined application of evolutionary search and adaptive critic terrain following. We show that adaptive critic based approximate dynamic programming techniques based on plantcontroller Jacobians can be used with systems characterized by discrete valued states and controls. Improvements over the best fixed policies (found using either an LP model or a genetic algorithm) in a highpenalty environment, average 83% under conditions both of stationary and nonstationary demand using real world data
Keywords: stock control, RL, application

 Abate:2008p23767


An approximate dynamic programming approach to probabilistic reachability for stochastic hybrid systems
A. Abate and M. Prandini and J. Lygeros and S. Sastry
Decision and Control, 2008. CDC 2008. 47th IEEE Conference on
4018  4023
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4739410&isnumber=4738560&punumber=4721212&k2dockey=4739410@ieeecnfs
http://dx.doi.org/10.1109/CDC.2008.4739410
his paper addresses the computational overhead involved in probabilistic reachability computations for a general class of controlled stochastic hybrid systems. An approximate dynamic programming approach is proposed to mitigate the curse of dimensionality issue arising in the solution to the stochastic optimal control reformulation of the probabilistic reachability problem. An algorithm tailored to this problem is introduced and compared with the standard numerical solution to dynamic programming on a benchmark example.
Keywords: probabilistic reachability, hybrid systems, RL, application

 Pietrabissa:2006p23881


An approximate dynamic programming approach to admission control in WCDMA networks
A. Pietrabissa and D. A. Borza
Computer Aided Control System Design, 2006 IEEE International Conference on Control Applications, 2006 IEEE International Symposium on Intelligent Control, 2006 IEEE
2087  2092
(2006)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4776962&isnumber=4776587&punumber=4768844&k2dockey=4776962@ieeecnfs
http://dx.doi.org/10.1109/CACSDCCAISIC.2006.4776962
This paper presents a Connection Admission Control (CAC) algorithm for WideCDMA (WCDMA) networks based on an Approximate Dynamic Programming (ADP) approach: the network is modeled as a Markov Decision Process (MDP), and ADP algorithms are used to compute the optimal admission policy. Two innovative concept are introduced: i) by formulating the problem as a linear programming problem, the blocking probabilities of the different classes of service supported by the network are controlled by appropriately modifying the reward function; ii) on the ground of practical considerations on the CAC problem, a restricted structure policy is determined, based on the reduction of the action space for low load conditions. Numerical simulation results show the effectiveness of the approach.
Keywords: admission control, WCDMA networks, RL, application

 Kwok:2004p10267


Reinforcement learning for sensing strategies
C. Kwok and D. Fox
Intelligent Robots and Systems (IROS2004)
(2004)
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1389903
Since sensors have limited range and coverage, mobile robots often have to make decisions on where to point their sensors. A good sensing strategy allows a robot to collect information that is useful for its tasks. Most existing solutions to this active sensing problem choose the direction that maximally reduces the uncertainty in a single state variable. In more complex problem domains, however, uncertainties exist in multiple state variables, and they affect the performance of the robot in different ways. The robot thus needs to have more sophisticated sensing strategies in order to decide which uncertainties to reduce, and to make the correct tradeoffs. In this work, we apply a least squares reinforcement learning method to solve this problem. We implemented and tested the learning approach in the RoboCup domain, where the robot attempts to reach a ball and accurately kick it into the goal. We present experimental results that suggest our approach is able to learn highly effective sensing strategies.
Keywords: application, RoboCup, sensing, RL, reach the ball, kick it

 deFarias:2003p4366


The Linear Programming Approach to Approximate Dynamic Programming
D. de Farias and B. V. Roy
Operations Research
51
850865
(2003)
Keywords: queueing network, RL, application

 Schneegass:2008p23946


Uncertainty propagation for quality assurance in Reinforcement Learning
D. Schneegass and S. Udluft and T. Martinetz
Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on
2588  2595
(2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4634160&isnumber=4633757&punumber=4625775&k2dockey=4634160@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2008.4634160
n this paper we address the reliability of policies derived by Reinforcement Learning on a limited amount of observations. This can be done in a principled manner by taking into account the derived Qfunctionpsilas uncertainty, which stems from the uncertainty of the estimators used for the MDPpsilas transition probabilities and the reward function. We apply uncertainty propagation parallelly to the Bellman iteration and achieve confidence intervals for the Qfunction. In a second step we change the Bellman operator as to achieve a policy guaranteeing the highest minimum performance with a given probability. We demonstrate the functionality of our method on artificial examples and show that, for an important problem class even an enhancement of the expected performance can be obtained. Finally we verify this observation on an application to gas turbine control.
Keywords: gas turbine control, RL, application

 Kalyanakrishnan:2007p10468


Batch reinforcement learning in a complex domain
S. Kalyanakrishnan and P. Stone
Proceedings of the 6th international joint conference on …
(2007)
http://portal.acm.org/citation.cfm?id=1329125.1329241
Temporal difference reinforcement learning algorithms are perfectly suited to autonomous agents because they learn directly from an agent's experience based on sequential actions in the environment. However, their most common algorithmic variants are relatively inefficient in their use of experience data, which in many agentbased settings can be scarce. In particular, they make just one learning "update" for each atomic experience. Batch reinforcement learning algorithms, on the other hand, aim to achieve greater data efficiency by saving experience data and using it in aggregate to make updates to the learned policy. Their success has been demonstrated in the past on simple domains like grid worlds and lowdimensional control applications like pole balancing. In this paper, we compare and contrast batch reinforcement learning algorithms with online algorithms based on their empirical performance in a complex, continuous, noisy, multiagent domain, namely RoboCup soccer Keepaway. We find that the two batch methods we consider, Experience Replay and Fitted Q Iteration, both yield significant gains in sample complexity, while achieving high asymptotic performance.
Keywords: Robocup Keepaway Soccer, RL, application
