banner
Nagi-ovo

Nagi-ovo

Breezing
github

RL

cover
cover

“速通” PPO

Proximal Policy Optimization 终于到了这几年 NLP 领域中比较火热的 RL 算法之一了 On-Policy 算法中,采集数据用的策略和训练的策略是相同的,这样的问题是数据用一次后就得丢弃,然后再重新采集数据,训练速度很慢。 PPO 背后的直觉  …
cover
cover
cover
cover
cover
cover
cover
cover

Actor Critic 方法初探

方差问题 策略梯度(Policy Gradient)方法因其直观和有效性而备受关注。我们之前探讨过Reinforce算法,它在许多任务中表现良好。然而,Reinforce 方法依赖于蒙特卡洛(Monte Carlo)采样来估计回报,这意味着我们需要使用整个回合的数据来计算回报…
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover

从 DQN 到 Policy Gradient

复习 Q-Learning 是一种用于训练 Q 函数的算法,该action-value 函数决定了在特定状态下采取某一特定动作的价值。通过维护 Q 表来保存所有state-action pair value 的记忆。 对于像《Space Invaders》这样的 Atari 游戏…
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover
cover

强化学习基础与Q-Learning

今年打 Kaggle 比赛用了 DeepSeek-Math-7B-RL 模型,学习时把 Claude 3.5 Sonnet 当作老师,这两个模型强大的原因都离不开 RL。隐约感觉这个领域的技术很强很美于是准备接触一下,奈何功底不扎实不好,看不懂 OpenAI Spinning…
cover
cover
cover
cover

Policy Gradient 入门学习

本文是对学习 Andrej Karpathy 的 Deep RL Bootcamp 及其博客的记录,博客链接:Deep Reinforcement Learning: Pong from Pixels RL 的进展并不主要由新奇惊人的想法推动: 2012 年的…
Ownership of this blog data is guaranteed by blockchain and smart contracts to the creator alone.