详细解决方案
Reinforcement Learning(四):Actor-Critic Methods
热度:14 发布时间:2023-12-12 01:06:30.0
![](https://img-blog.csdnimg.cn/20200809150851696.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
主要思想:
![](https://img-blog.csdnimg.cn/20200809150958247.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
Policy Network (Actor)
![](https://img-blog.csdnimg.cn/20200809151103462.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
Value Network (Critic):
![](https://img-blog.csdnimg.cn/20200809151244421.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
形象对比:
![](https://img-blog.csdnimg.cn/20200809151315727.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
Train the Neural Networks
![](https://img-blog.csdnimg.cn/2020080915143066.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
具体步骤:
![](https://img-blog.csdnimg.cn/20200809151503702.png)
Update value network q using TD
![](https://img-blog.csdnimg.cn/20200809151551523.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
Update policy network Π using policy gradient
![](https://img-blog.csdnimg.cn/20200809151654597.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
Actor-Critic Method
![](https://img-blog.csdnimg.cn/20200809151800142.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
![](https://img-blog.csdnimg.cn/20200809151812129.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
![](https://img-blog.csdnimg.cn/20200809151858506.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
![](https://img-blog.csdnimg.cn/20200809151932615.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
Summary of Algorithm
![](https://img-blog.csdnimg.cn/2020080915214158.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
![](https://img-blog.csdnimg.cn/20200809152516717.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
Summary
Policy Network and Value Network
![](https://img-blog.csdnimg.cn/20200809152635778.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
![](https://img-blog.csdnimg.cn/20200809152655263.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)
Training
![](https://img-blog.csdnimg.cn/20200809152741934.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzM4MTU2MTA0,size_16,color_FFFFFF,t_70)