• 1个额外的固定、不训练的随机网络
• RND在强探索（hard exploration，奖励很稀疏）环境中表现不错

方法

1. 计数

1. 在tabular setting中，状态、动作都是可数的，可以用一张表格来维护每个状态-动作对的访问次数，并且使用形如以下的公式来定义内在奖励：

2. 在非tabular setting中，状态、动作不完全可数，则可使用伪计数法（pseudo-counts），主要是使用状态密度估计的变化程度来衡量。参考论文：《Unifying count-based exploration and intrinsic motivation》

2. 预测，主要是通过前向(s,a->s’)/反向(s,s’->a)预测环境动态的误差来设计内在奖励

This paper introduces a different approach where the prediction problem is randomly generated. This involves two neural networks: a ﬁxed and randomly initialized target network which sets the prediction problem, and a predictor network trained on data collected by the agent.

RND引入两个网络：

1. 预测网络，是强化学习网络的一部分，可训练，预测观测值的隐特征，$\hat{f}: \mathcal{O} \rightarrow \mathbb{R}^{k}$；
2. 目标网络，额外引入的网络，不训练，固定参数，随机初始化，输出观测值的隐特征作为真值，$f: \mathcal{O} \rightarrow \mathbb{R}^{k}$。

1. 训练数据少。当样本中相似数据不多时，预测误差会大
2. 环境的动态随机。环境的随机转换是前向预测误差增大的原因。
3. 模型不适合。关键信息缺失（我也不明白是什么关键信息）或者预测的模型不足以拟合目标函数，会使得预测误差很大
4. 动态学习。学习到最后不能稳定地拟合目标函数。

Note that even where one is not trying to combine episodic and non-episodic reward streams, or reward streams with different discount factors, there may still be a beneﬁt to having separate value functions since there is an additional supervisory signal to the value function. This may be especially important for exploration bonuses since the extrinsic reward function is stationary whereas the intrinsic reward function is non-stationary.

状态归一化

1. 在训练开始之前与环境随机交互取得经验初始化移动平均均值和标准差
2. 训练开始之后，状态减去移动平均均值
3. 再除以移动平均标准差
4. 将状态clip到[-5, 5]

Observation normalization is often important in deep learning but it is crucial when using a random neural network as a target, since the parameters are frozen and hence cannot adjust to the scale of different datasets.

Lack of normalization can result in the variance of the embedding being extremely low and carrying little information about the inputs.

评价

We ﬁnd that the RND exploration bonus is sufﬁcient to deal with local exploration, i.e. exploring the consequences of short-term decisions, like whether to interact with a particular object, or avoid it. However global exploration that involves coordinated decisions over long time horizons is beyond the reach of our method.

-------------本文结束感谢您的阅读-------------

0%