N1H111SM's Miniverse

2020/02/28 Share

Materials

# Motivation

• What does it mean for an agent to explore its environment well?
• Which methods work well, and under which assumptions and environmental settings?
• Where do current approaches fall short, and where might future work seek to improve?

Embodied Visual Exploration的算法比较难以横向进行比较，因为不同的工作之间的重心不同，因此测试的evaluation metric也选择不同：overcoming sparse rewards, pixelwise reconstruction of environments, area covered in the environment, object interactions,information gathering for solving downstream tasks such as navigation/recognition/pose estimation. 因此这篇工作提出了一个unified view of exploration algorithms for visually rich 3D environments, and a common evaluation framework to understand their strengths andweaknesses.

# Problem Setting

## POMDP

A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a 7-tuple $(S,A,T,R,\Omega ,O,\gamma )$:

• $S$ is a set of states,
• $A$ is a set of actions,
• $T$ is a set of conditional transition probabilities between states,
• $R:S\times A\to \mathbb {R}$ is the reward function,
• $\Omega$ is a set of observations,
• $O$ is a set of conditional observation probabilities, and
• $\gamma \in [0,1]$ is the discount factor.

POMDP和MDP唯一不同的地方在于，PDMDP多了agent接受到的observation. 对于当前世界的state $s\in S$，agent采取了action $a\in A$，接着世界的state进入到了$s^\prime$，agent接受到的observation 遵从分布$O(o|s^\prime ,a)$.

## Curiosity

In the curiosity paradigm, the agent is encouraged to visit states where its predictive model of the environment is uncertain. Dynamics-based formulation of curiosity指的是通过学习一个forward-dynamics model $\mathcal F$，模型每一次选择和当前state最大差异的下一个state。也就是该模型预测下一个state会是怎样的表示： $\hat{\boldsymbol{s}}_{t+1}=\mathcal{F}\left(\boldsymbol{s}_{t}, a_{t}\right)$ ，通过定义reward function $R$即可：

## Novelty

Novelty reward直接硬记录到达每一个state的次数，然后让reward function和该次数成负相关即可。其中需要对3D环境的平面进行网格化建模，含义是是同一个地点不要来很多次。

## Coverage

Coverage则认为Novelty的判断过于武断，因为在3D环境中不同位置的信息量是不一样的，这和该位置的周围structure相关。Whereas novelty encourages explicitly visiting all locations, coverage encourages observing all of the environment. 换句话说，到的地方越多不等价于observe到的信息越大。

The coverage reward consists of the increment in some observed quantity of interest:

## Reconstruction

Reconstruction-based methods use the objective of active observation completion to learn exploration policies. The reconstruction reward scores the quality of the predicted outputs:

# Evaluation Framework

• PointNav: how to quickly navigate from point A to point B?