Google Research Football - RL Environment

字数统计: 2k阅读时长: 8 min

 2019/06/12   Share

源github地址：google football research github page

Abstract

文章提出了一个新的Google Research Football Environment，该环境是一个基于物理引擎的足球环境，非常容易迁移，并且是基于开源licence的。文章同时提出了三个不同难度的full-game sceario，提出了Football Benchmarks来标定模型的表现。同时文章测试了3个常用的强化学习模型（IMPALA, PPO, Ape-X DQN）。最后文章还提出了稍微简单一些的scenario，Football Academy。

Introduction

文章的主要贡献为：

provide the Football Engine, a highly-optimized game engine that simulates the game of football,

propose the Football Benchmarks, a versatile set of benchmark tasks of varying difficulties that can be used to compare different algorithms,

propose the Football Academy, a set of progressively
harder and diverse reinforcement learning scenarios,

evaluate state-of-the-art algorithms on both the Football
Benchmarks and the Football Academy, providing an extensive set of reference results for future comparison, and

sprovide a simple API to completely customize and define
new football reinforcement learning scenarios.

Football Engine

本文的Football Environment是基于另一个工作GameplayFootball simulator.引擎模拟了整个足球游戏，它接收来自两方球队的input action。该引擎实现了足球比赛的众多方面，包括开球，进球，犯规，角球，点球以及边线球。

支持的足球规则

几乎所有的足球规则，甚至包括换人。游戏长度是按照frame来进行计算的，默认的整场游戏是3000 frames，但是这一点可以进行customize，初始player数量以及他们的位置也可以被调整。每方的球员会随着时间变长而疲倦，而每一方只能最多进行3次换人。

内置AI对手

内置的AI对手是rule based AI，由GameplayFootball simulator开发。困难指数$\theta$是通过调节对手的决策反应时间来影响对手的难度的。推荐的三个难度等级easy, medium, hard的$\theta$值分别为0.05, 0.6, 0.95，我们还能够将内置的AI对手换为我们自己的算法。然后文章介绍了我们最关心的问题：

Moreover, by default, our non-active players are also con- trolled by another rule-based bot. In this case, the behav- ior is simple and corresponds to reasonable football actions and strategies, such as running towards the ball when we are not in possession, or move forward together with our active player. In particular, this type of behavior can be turned off for future research on cooperative multi-agents if desired.

也就是说除了当前的player是自己控制的，其余的player现阶段是通过rule-based的方式被控制的，但是未来我们可以将这个特性关掉从而借助该环境进行multi-agents的研究。

State & Observation

文章定义state为游戏当前的所有状态信息的集合（complete set of data returned by the environment after actions are performed），包括ball position/possession, coordinates of all players, the active player, game state(球员疲惫程度，黄牌，比分等等) and current pixel frame.

同时文章定义observataion指的是state进行任意一种转换后的结果，该结果是作为input传递给control algorithm的。文章提出了三种representation：

Pixels： 1280$\times$720 RGB 图像
Super Mini Map：SMM由四个96$\times$72的矩阵组成，编码了包括了主队、客队、足球以及active player的信息。矩阵是binary的形式，简单来说就是bitmap，表征该位置上是否有上述的物体。
Floats：一个更加紧凑的representation，115维向量用于表征所有的比赛信息，包括players coordinates, ball possession and direction, active player, or game mode.

Actions & Accessibility

动作空间为16个离散化动作，包括八种移动动作对应八个方向、三种传球方向（Short, High, Long）、一种射门动作（Shot）、冲刺动作（Sprint，会影响球员体力值）、停止移动动作（Stop-Moving）、停止冲刺动作（Stop-Sprint）以及不进行动作（Do-Nothing）。

环境可以用于直接进行玩家和玩家之间的对抗，也可以dueling algorithms。同时游戏可以使用键盘或者gamepad进行。另外replays of several rendering qualities在训练时会被自动保存，便于研究者进行观察。

随机性

游戏具有两种模式，可以是随机的或者是确定的。随机性在于同样的状态同样的action可能的导致不同的后果，而确定的模式在同样的策略和同样的状态下总是得到相同的结果。

API & Performance

这套Engine是和OpenAI Gym的API兼容的，也就是RL中常见的reset()以及obs, reward, done, info = step(action) 那一套接口，以后有空或许可以对其做一个简单的记录。

整个Engine是写在经过大量优化的C++代码上的，可以使用GPU进行渲染。实验中在单机16核的机器（Intel Xeon E5-1650 v2 CPU3.5GHz）上每天能够跑25M个step。

Football Benchmarks

Similar to the Atari games in the Arcade Learning Environment, in these tasks, the agent has to interact with a fixed environment and maximize its episodic reward by sequentially choosing suitable actions based on observations of the environment.

Algorithms

Football Benchmarks的游戏目标是对抗Engine提供的opponent bot取得全场比赛的胜利。同样的，这些benchmarks被分为easy medium 以及 hard三个level。文章采取了三个现阶段比较常用的算法来cover不同的研究场景。PPO用来模拟单机多进程的训练；IMPALA则采用了集群，500个actor的setting；以及Ape-X DQN。这几个算法未来几天有时间可以研究一下。

IMPALA

原文地址：IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architecture

该算法将learning和acting进行了解耦：单机worker不是将当前policy的gradient传回，而是将experience trajectories传输给center learner，而为了解决off-policy的问题，IMPALA提出了一种actor-critic的更新方法V-trace。本文采用了500个actor，Adam optimizer，进行500M step的训练。

PPO

原文地址：Proximal Policy Optimization Algorithms

该算法是一个online policy gradient算法，它优化一个clipped surrogate objective。本文实验采用了OpenAI的baseline，在16个并行worker上进行实验。同时采用了CNN。

Ape-X DQN

原文地址：Distributed Prioritized Experience Replay

Ape-X DQN是一个高度scalable的DQN版本，和IMPALA相同的是，该算法也将learning和acting解耦，但是它采用了distributed replay buffer和Q-learning variant consisting of dueling network architectures & double Q-learning。很多超参和IMPALA设置得相同（为了更好比较）。

Reward

文章提出了两种设置reward的方法，分别为SCORING和CHECKPOINT。SCORING方法就是全场胜负进行+1/-1的奖励反馈。CHECKPOINT是为了解决sparsity问题而提出的。首先将对手的场地划分为10个区域，越接近对手的球门就说明越有利，当一名球员带球穿越region时就会获得+0.1的reward。

First time our player steps into one region with the ball, the reward coming from that region and all previously unvisited further ones will be collected. In to- tal, the extra reward can be up to +1, the same as scoring a goal. To avoid penalizing an agent that would not go through all the checkpoints before scoring, any non-collected checkpoint reward is added to the scoring reward. Checkpoint rewards are only given once per episode.

文章指出，在绝大部分的representation下这种reward的奖励方式是非马尔科夫的，这种CHECKPOINT的奖励设置方法基于我们自己的domain knowledge：越靠近球门越容易进球。