# Model Based-or-Free

Model-Based：

• 智能体Agent在已知模型（$\mathcal{S,A,R,P}$有限且确定）或者先学习一个模型（使用有监督对状态转移、奖励函数进行学习而得到），并在这个模型中使用planning（预测所有状态转移可能）方法来计算解决方案

Now if we know what all those elements of an MDP are, we can just compute the solution before ever actually executing an action in the environment. In AI, we typically call computing the solution to a decision-making problem before executing an actual decision planning. Some classic planning algorithms for MDPs include Value Iteration, Policy Iteration, and whole lot more.

Model-Free:

• 智能体在模型（$\mathcal{S,A,R,P}$可能确定但没有使用planning方式解决，也可能不确定）中试错，并且使用learning（不预测全部可能性）方法来产生最佳策略

But the RL problem isn’t so kind to us. What makes a problem an RL problem, rather than a planning problem, is the agent does not know all the elements of the MDP, precluding it from being able to plan a solution. Specifically, the agent does not know how the world will change in response to its actions (the transition function TT), nor what immediate reward it will receive for doing so (the reward function RR). The agent will simply have to try taking actions in the environment, observe what happens, and somehow, find a good policy from doing so.

Model-Based：DP、Policy Iteration、Value Iteration……

Model-Free：SARSA、Q-Learning、PG……

if you want a way to check if an RL algorithm is model-based or model-free, ask yourself this question: after learning, can the agent make predictions about what the next state and reward will be before it takes each action? If it can, then it’s a model-based RL algorithm. if it cannot, it’s a model-free algorithm.

What is the difference between model-based and model-free reinforcement learning?

OpenAI Spinning Up : A Taxonomy of RL Algorithms

• PPO
• SAC
• ……

• SARSA
• Q-Learning
• DQN
• ……

# On-policy or Off-policy

1. 单样本学习，样本用完即丢，样本连续不断输入，非数据集，而是数据流
2. 单样本的（SGD）
3. 单样本或批样本学习，样本连续不断输入，非数据集，而是数据流

1. 批样本或全样本学习多次，静态样本集
2. 批样本学习
3. 全样本学习，静态数据集

1. 对于第一种，强化学习不属于On-line Learning也不属于Off-line Learning，不属于Off-line Learning是因为样本非静态、非固定，不属于On-line Learning是因为对于Q-Learning、Sarsa、PG、PPO等算法样本用完即丢，对于DQN、TD3等算法样本重复利用。
2. 对于第二种，强化学习包括On-line Learning及On-line Learning
3. 低于第三种，强化学习属于On-line Learning

On-policy、Off-policy与On-line、Off-line之间有关系吗？

• 采样时间序列$S_{0},A_{0},R_{0},S_{1},A_{1},R_{2},…,S_{n},A_{n},R_{n}$的策略
• 官话：指导个体产生与环境进行实际交互行为的策略
• 未必由一个模型表示

• 待优化的策略
• 官话：用来评价状态或行为价值的策略或者待优化的策略称为目标策略

• 简言之，边采样边学习
• 官话：如果个体在学习过程中优化的策略与自己的行为策略是同一个策略时，这种学习方式称为同步策略学习（on-policy learning）
• 行为策略与目标策略是同一个

• 简言之，你采样我学习
• 官话：如果个体在学习过程中优化的策略与自己的行为策略是不同的策略时，这种学习方式称为异步策略学习（off-policy learning）
• 行为策略与目标策略不同，行为策略可能是目标策略的“分身”（双网络结构），或者完全是另一个采样的策略

SARSA Q-learning
Choosing A’ π π
Updating Q π μ

# Stochastic or Deterministic

-------------本文结束感谢您的阅读-------------

0%