深度解析强化学习理论 | PodLM - AI Podcast Generator

Sources

作者：朱哲清Bill链接：https://www.zhihu.com/question/312164724/answer/2444672861来源：知乎著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。想着把我所知道的RL理论稍微整理一下，能够为大家梳理一下思路。题主提的第二篇文章作者Ian Osband和我是同一个PhD advisor，所以对他之前做的research要更熟悉一些。我想从一个更系统的方式来讲一下目前RL算法的理论结果。总体来说，RL理论应该分成两大块，一个是假设transition kernel，也就是world model已知的情况下的performance bound和基于exploration的performance bound。总体来说后者的重要性高于前者，对于应用有更深远的影响，毕竟大多数RL解决的问题是不可能提供一个transition kernel的。在以下的内容中，我会同时列举一些我个人认为比较重要，影响比较深远的理论paper。同时在以下内容中，我假设读者对于RL的常规算法是已经有了比较深的理解了，所以就不在算法一一是什么上面赘述了。已知Transition Kernel / World Model -> RL算法的convergence先来讲第一部分，已知transition kernel情况下的RL bound。这边要研究的其实是RL算法本身的convergence和learning能力。这个对于实际应用中RL算法的能力借鉴意义不大。但是对于后续的理论研究（下文第二部分）的意义非常大。其实最早的时候，已知transition kernel的情况下，这种AI方法叫做planning，属于Dynamic Programming分支下的一些方法。其中最常用的两种算法是Value Iteration和Policy Iteration。Value Iteration: 最早由Richard Bellman，也就是著名的Bellman Equation的提出者（他其实最重要的贡献是作为Dynamic Programming的提出者），在1957年MDP的重要文章中提出的。由于Value Iteration是基于一个给定MDP，并且Discrete State Space的，所以RL的regret bound在这没有太大意义。更重要的是要理解这边的convergence result，根据Contraction Mapping的理论，Value Iteration是一定会converge的。Convergence rate和 γ\gamma\gamma (discount factor)相关。 γ\gamma\gamma 越小convergence越快。Bellman, Richard. "A Markovian decision process."Journal of mathematics and mechanics(1957): 679-684.Policy Iteration: 最早我所知道的Policy Iteration Algorithm由Ronald Howard在1960年提出，同样的，它的目的也是要convergence而不是regret bound。Policy Iteration我所知道的general理论结果只有finite step guarantee，但是在实际应用中即便是很复杂的环境，它的convergence速度一般是在几十个iteration左右。非常快。这边分享一个Ronald的趣闻，他其实主要研究方向不是DP，而是Decision Analysis。然后他在我们学校上课的时候会让我们在考试卷中的选择题对每个选项上标记自己所认为的概率，比如A是10%，B是10%，C是20%，D是60%。然后如果你这题错了，就会扣掉你所选的选项的ln(probability)。如果你碰巧在这个选项上放了0概率，那你这门课就不及格了。大家可以想想这个举措的目的哈哈，很有深意。Howard, Ronald A. "Dynamic programming and markov processes." (1960).在Value Iteration和Policy Iteration之后呢出现了一大批的Approximate Dynamic Programming的理论结果。首先是Function Approximation (state projection/abstraction) 在Value Iteration上的延伸。其中最重要的结果就是Fitted Value-Iteration，意思就是说，与其用per state action pair的value，可以用function approximator来代替这些value function。 Error bound 如下， π\pi\pi 是greedy to V：首先是Function Approximation （state projection/abstraction）在Value Iteration上的延伸。其中最重要的结果就是Fitted value-Iteration，意思就是说，与其用per state action pair的value，可以用function approximator来代替这些value function。Error bound 如下， π \pi 是贪婪到 V： ||V∗−Vπ||∞≤2(1−γ)2infV∈F||V∗−V||∞||V^* - V^\pi||_\infty \leq \frac{2}{(1 - \gamma)^2}\inf_{V\in \mathcal{F}}||V^* - V||_\infty||V^* - V^\pi||_\infty \leq \frac{2}{(1 - \gamma)^2}\inf_{V\in \mathcal{F}}||V^* - V||_\infty 具体可以看下文。Bertsekas, Dimitri P., and John N. Tsitsiklis.Neuro-dynamic programming. Athena Scientific, 1996.然后是Approximate Policy Iteration。这个也是上文中提出来的，对于上文function approximator之中得到的value function做基础在每个iteration中做greedy policy。Error bound如下，π\pi\pi 是greedy to VkV_kV_k ： limk→∞sup||V∗−Vπk||∞≤2γ(1−γ)2limk→∞sup||Vk−Vπk||∞\lim_{k\rightarrow \infty}\sup||V^* - V^{\pi_k}||_\infty \leq \frac{2\gamma}{(1 - \gamma)^2}\lim_{k\rightarrow \infty}\sup||V_k - V^{\pi_k}||_\infty\lim_{k\rightarrow \infty}\sup||V^* - V^{\pi_k}||_\infty \leq \frac{2\gamma}{(1 - \gamma)^2}\lim_{k\rightarrow \infty}\sup||V_k - V^{\pi_k}||_\infty 具体也可以看下文。Bertsekas, Dimitri P., and John N. Tsitsiklis.Neuro-dynamic programming. Athena Scientific, 1996.在这之后呢有了linear approximator的一些很重要的结果，我PhD advisor的这篇文章中的结果应该是最早的TD( λ\lambda\lambda )在linear approximator中的结果。给定一个parameter α\alpha\alpha 的linear space, Error bound是 ||Vα∗−Vπ||μ≤1−λγ1−γinfα||Vα−Vπ||μ||V_{\alpha^} - V^{\pi}||_\mu \leq \frac{1 - \lambda\gamma}{1 - \gamma}\inf_\alpha||V_\alpha - V^{\pi}||_\mu||V_{\alpha^} - V^{\pi}||_\mu \leq \frac{1 - \lambda\gamma}{1 - \gamma}\inf_\alpha||V_\alpha - V^{\pi}||_\mu 这边 μ\mu\mu 是environment true probability distribution，然后norm是probability space weighted norm。Tsitsiklis, John, and Benjamin Van Roy. "Analysis of temporal-diffference learning with function approximation."Advances in neural information processing systems 9 (1996)Tsitsiklis、John 和 Benjamin Van Roy。“使用函数近似分析时间差异学习。”神经信息处理系统进展 9 （1996）最后呢说个我本科Advisor Ron Parr的理论结果。在一定条件下当environment true probability distribution是policy induced kernel的stationary distribution的情况下，我们可以得到一个在linear approximation下的uniqueness的结果，我们通常把这个结果叫做LSTD (Least Squared Temporal Difference)。Error bound是 ||Vπ−VTD||μ≤11−γ2infV∈F||Vπ−V||μ||V^\pi - V_{TD}||_\mu \leq \frac{1}{\sqrt{1 - \gamma^2}}\inf_{V \in \mathcal{F}}||V^{\pi} - V||_\mu||V^\pi - V_{TD}||_\mu \leq \frac{1}{\sqrt{1 - \gamma^2}}\inf_{V \in \mathcal{F}}||V^{\pi} - V||_\mu Lagoudakis, Michail G., and Ronald Parr. "Least-squares policy iteration."The Journal of Machine Learning Research4 (2003): 1107-1149.Lagoudakis、Michail G. 和 Ronald Parr。“最小二乘策略迭代。”机器学习研究杂志4 （2003）：1107-1149。以上内容借鉴了Semi Munos的RL Theory的一些slides里的内容。在说完了传统的planning算法后，我也快速提一下Policy Gradient和Actor-Critic的一些理论结果。在Sutton的经典入门书籍中，他其实做了一个policy gradient的proof，就是说，policy gradient的direction in expectation和增加cumulative reward的direction是一致的。并且也有不少的在一定假设下的convergence guarantee和convergence rate的analysis。Sutton, Richard S., and Andrew G. Barto.Reinforcement learning: An introduction. MIT press, 2018.Sutton， Richard S.， and Andrew G. Barto.强化学习：简介。麻省理工学院出版社，2018 年。Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans Proceedings of the 37th International Conference on Machine Learning, PMLR 119:6820-6829, 2020.Jincheng Mei， Chenjun Xiao， Csaba Szepesvari， Dale Schuurmans 第 37 届机器学习国际会议论文集， PMLR 119：6820-6829， 2020.对于Actor-critic来说，其实它的convergence guarantee很久以前就被研究过了，这边policy gradient部分比较好证一些因为它的critic function有一些parameter和model上的assumption。Konda, Vijay, and John Tsitsiklis. "Actor-critic algorithms."Advances in neural information processing systems12 (1999).Konda, Vijay R., and John N. Tsitsiklis. "Onactor-critic algorithms."SIAM journal on Control and Optimization42.4 (2003): 1143-1166.未知Transition Kernel / World Model -> RL算法收集数据的能力在说了那么多以上的结果后，我们来讲一讲脱离已知环境后RL算法的能力。这边我们很大程度上我们就开始讨论exploration的作用了，也是过去10年RL理论研究的重头戏。我会从Bandit开始说起说到RL exploration的结果。值得注意的是，这些结论大多数是基于这些paper提出的exploration的算法的。所以很多结果不一定能够直接在后续用得上。在以下内容我就不说epsilon-greedy和Boltzmann的情况了，因为现在基本不会有人研究这种exploration的regret bound了。我们在这边一般要研究的是算法的regret bound和sample complexity。简而言之就是需要多少数据才能达到最优解或者次优解。首先我们先说Bandit Learning。市面上的证明很多，Thompson Sampling这方向我主要会用我advisor这条线上的一些结果。基本都是state-of-the-art的证明结果了，但我个人认为这里的很多证明技巧要更elegant一些。UCB我会用市面上最well-known的一些paper。Non-parametric Bandit with Thompson Sampling的理论结果regret bound是 O(NTlogT)O(\sqrt{NTlogT})O(\sqrt{NTlogT}) 。一般来说理解这种bound可以用average regret来理解，也就是除以T，得到这个是sublinear的。Agrawal, Shipra, and Navin Goyal. "Further optimal regret bounds for thompson sampling." Artificial intelligence and statistics. PMLR, 2013.Agrawal、Hipra 和 Navin Goyal。“汤普森采样的进一步最佳遗憾边界。”人工智能和统计学。PMLR，2013 年。Non-parametric Bandit with UCB的理论结果regret bound是（对于UCB1）是一个非常非常复杂的bound，但是总的bound和Thompson Sampling在同一个数量级。具体可见Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. "Finite-time analysis of the multiarmed bandit problem."Machine learning47.2 (2002): 235-256.Auer、Peter、Nicolo Cesa-Bianchi 和 Paul Fischer。“多臂老虎机问题的有限时间分析。”机器学习47.2 （2002）：235-256。Logistic Bandit with Thompson Sampling的最新理论结果是用information theory证的，我觉得非常elegant。regret bound是 O(dTlogT)O(d\sqrt{T}logT)O(d\sqrt{T}logT) .也是sublinear。Dong, Shi, and Benjamin Van Roy. "An information-theoretic analysis for thompson sampling with many actions."Advances in Neural Information Processing Systems31 (2018).Dong、Shi 和 Benjamin Van Roy。“具有许多动作的汤普森采样的信息论分析。”神经信息处理系统的进展31 （2018）。Logistic Bandit with UCB: Lihong的组对于LinUCB做了一个PAC bound，这边得到的regret bound with Probability 1- δ\delta\delta : O(Tdlog3(NTlogT/δ)O(\sqrt{Tdlog^3(NTlogT/\delta})O(\sqrt{Tdlog^3(NTlogT/\delta}) .Chu, Wei, et al. "Contextual bandits with linear payoff functions."Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 2011.Chu， Wei， et al. “具有线性支付函数的上下文老虎机。”第十四届人工智能与统计国际会议论文集。JMLR 研讨会和会议论文集，2011 年。我在这边再加一个新的exploration算法，叫做Information Directed Sampling (IDS)。这个方法的直接目的就是在expected regret和information gain（得到的新的信息）之间做权衡。Information Directed Sampling: 在linear bandit上，IDS的regret bound可以达到 O(log(|A|)dT)O(\sqrt{log(|\mathcal{A}|)dT})O(\sqrt{log(|\mathcal{A}|)dT}) .Russo, Daniel, and Benjamin Van Roy. "Learning to optimize via information-directed sampling."Advances in Neural Information Processing Systems27 (2014).Russo、Daniel 和 Benjamin Van Roy。“通过信息导向抽样学习优化。”神经信息处理系统进展27 （2014）。我们接下来讲上述的这些exploration算法推广到RL上：首先我们看UCB exploration下的RL算法，regret bound大约是： O(H3SAT)O(\sqrt{H^3SAT})O(\sqrt{H^3SAT}) . 这篇paper确实有很重要的意义，因为UCB的regret bound proof一般不好做。Jin, Chi, et al. "Is Q-learning provably efficient?."Advances in neural information processing systems31 (2018).Jin， Chi， et al. “Q-learning 是否可证明有效？.”神经信息处理系统的进展31 （2018）。我们接下来看Thompson Sampling下的RL算法，regret bound大约是： O(HSAT)O(HS\sqrt{AT})O(HS\sqrt{AT}) 。这篇paper相对比较有代表性，但是Thompson Sampling有很多版本，所以这只是其中一个结果。Ouyang, Yi, et al. "Learning unknown markov decision processes: A thompson sampling approach."Advances in neural information processing systems30 (2017).Ouyang， Yi， et al. “学习未知马尔可夫决策过程：一种汤普森抽样方法。”神经信息处理系统的进展30 （2017）。Thompson Sampling和UCB在RL setting下有个大问题，就是resampling。也就是说step和step之间的belief不一致，导致exploration效果不好。所以这边引出了Deep Exploration。如果可以保证一个episode之间的value function belief的一致性，那regret bound可以达到大约： O(HSAHL)O(H\sqrt{SAHL})O(H\sqrt{SAHL}) 。这边L指的是episode的数量。可以想作HL=T。这个算法的想法引申自RLSVI。Osband, Ian, et al. "Deep Exploration via Randomized Value Functions."J. Mach. Learn. Res.20.124 (2019): 1-62.Van Roy, Benjamin, and Zheng Wen. "Generalization and exploration via randomized value functions."stat1050 (2014): 4.最后讲一下IDS在RL上的结果，实验上来说，这个确实已经outperform了以上所有的结果了，但是asymptotic notation的结果还是相对比较dependent on information structure。所以有待更多的研究。Lu, Xiuyuan, et al. "Reinforcement learning, bit by bit." arXiv preprint arXiv:2103.04047 (2021).总体来说希望以上这些能够帮大家理解目前RL主流理论的目前结果吧。整理了很久，希望大家多多支持。

Podcast Editor

Podcast.json

Preview

Audio

Title

Description

Topics