Materials
Motivation
本文提出了Embodied Visual Exploration这一概念:how might a robot equipped with a camera scope out a new environment? 关于Visual Exploration,有三个问题需要解决:
- What does it mean for an agent to explore its environment well?
- Which methods work well, and under which assumptions and environmental settings?
- Where do current approaches fall short, and where might future work seek to improve?
首先,CV中的问题分为三个层次:第一层是模型被动地学习人为收集和标注好的数据;第二层是embodied active perception,学习task-specific controls;第三层是embodied visual exploration, the goal is inherently more open-ended and task-agnostic: how does an agent learn to move around in an environment to gather information that will be useful for a variety of tasks that it may have to perform in the future?
Embodied Visual Exploration的算法比较难以横向进行比较,因为不同的工作之间的重心不同,因此测试的evaluation metric也选择不同:overcoming sparse rewards, pixelwise reconstruction of environments, area covered in the environment, object interactions,information gathering for solving downstream tasks such as navigation/recognition/pose estimation. 因此这篇工作提出了一个unified view of exploration algorithms for visually rich 3D environments, and a common evaluation framework to understand their strengths andweaknesses.
Problem Setting
首先Embodied Visual Exploration定义为:agent在环境中进行一定轮数的observe-action-update循环,其中action的选择目标是最大化information gain。本质上就是一个Partially Observable Markov Decision Process (POMDP)。
POMDP
A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a 7-tuple $(S,A,T,R,\Omega ,O,\gamma )$:
- $S$ is a set of states,
- $A$ is a set of actions,
- $T$ is a set of conditional transition probabilities between states,
- $R:S\times A\to \mathbb {R} $ is the reward function,
- $\Omega$ is a set of observations,
- $O$ is a set of conditional observation probabilities, and
- $\gamma \in [0,1]$ is the discount factor.
POMDP和MDP唯一不同的地方在于,PDMDP多了agent接受到的observation. 对于当前世界的state $s\in S$,agent采取了action $a\in A$,接着世界的state进入到了$s^\prime$,agent接受到的observation 遵从分布$O(o|s^\prime ,a)$.
Exploration Paradigms
Curiosity
In the curiosity paradigm, the agent is encouraged to visit states where its predictive model of the environment is uncertain. Dynamics-based formulation of curiosity指的是通过学习一个forward-dynamics model $\mathcal F$,模型每一次选择和当前state最大差异的下一个state。也就是该模型预测下一个state会是怎样的表示: $\hat{\boldsymbol{s}}_{t+1}=\mathcal{F}\left(\boldsymbol{s}_{t}, a_{t}\right)$ ,通过定义reward function $R$即可:
这个forward-dynamics model是通过online training的方式进行学习的,只需要在两个state之间最小化$\left|F\left(s_{t}, a_{t}\right)-s_{t+1}\right|_{2}^{2}$.
Novelty
Novelty reward直接硬记录到达每一个state的次数,然后让reward function和该次数成负相关即可。其中需要对3D环境的平面进行网格化建模,含义是是同一个地点不要来很多次。
Coverage
Coverage则认为Novelty的判断过于武断,因为在3D环境中不同位置的信息量是不一样的,这和该位置的周围structure相关。Whereas novelty encourages explicitly visiting all locations, coverage encourages observing all of the environment. 换句话说,到的地方越多不等价于observe到的信息越大。
The coverage reward consists of the increment in some observed quantity of interest:
其中$I_t$指的是在时间步$t$是观测到interesting object的数目。可供选择的”things”可以为aera/object/landmark/random view. 其中random view指的是我们可以随机在这个环境中指定viewpoint作为奖励,如果agent看到了这个viewpoint,那么就可以获得相应的reward. This method is similar to the “goal agnostic” baseline.
Reconstruction
Reconstruction-based methods use the objective of active observation completion to learn exploration policies. The reconstruction reward scores the quality of the predicted outputs:
其中$V(\mathcal P)$是camera view在pose $\mathcal P$处的true query view,$\hat V_t(\mathcal P)$是agent在时间步$t$时的view reconstructions,$d$ 是定义在view上的distance function. Whereas curiosity rewards views that are individually surprising, reconstruction rewards views that bolster the agent’s correct hallucination of all other views.
Evaluation Framework
第一种Evaluation Metric比较简单,可以计算Model在Exploration期间访问了多少”interesting things”,其中如上文所述,interesting things可以是area, objects, and landmarks。
第二种Evaluation Metric测试Exploration在Downstream task transfer上的性能。最近被拿来广泛测试的主要downstream task包括了以下三种:
- PointNav: how to quickly navigate from point A to point B?
- View localization: where was this photo taken?
- Reconstruction: what can I expect to see at point B?