37 implementation details of PPO algorithm implementation (2/3) 9 Atari-specific implementation details

Blog Title: The 37 Implementation Details of Proximal Policy
Optimization
Authors: Huang, Shengyi; Dossa, Rousslan Fernand Julien; Raffin, Antonin; Kanervisto, Anssi; Wang, Weixun
Blog address: https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
Code github warehouse address: https://github.com/vwxyzjn/ppo-implementation-details

This article continues from the previous article on 37 implementation details of the PPO algorithm (1/3) 13 core implementation details. This article mainly introduces the 9 implementation details of the PPO algorithm in the context of an Atari-type game environment.

1. The Use of NoopResetEnv 【Environment Preprocessing】

This wrapper samples initial states by a random number between 1 and 30 when reset
This wrapper samples initial states by taking a random number (between 1 and 30) of no-ops on reset.

The wrapper concept comes from (Mnih et al., 2015) and Machado et al., 2018) who suggested that NoopResetEnv is a way to inject randomness into the environment.

2. The Use of MaxAndSkipEnv 【Environment Preprocessing】

wrapper skips 4 frames by default, repeats the last action of agent on the skipped frames, and adds up the rewards of skipped frames. This frame-skipping technique can significantly speed up the algorithm because the computational cost of the environment step is lower than the forward pass of agent (Mnih et al., 2015).

wrapper also returns the maximum pixel value of the last two frames to help handle some special cases of Atari games (Mnih et al., 2015).

As shown in the quotation below, the source of this wrapper is from (Mnih et al., 2015).

More precisely, the agent observes and selects actions every k frames rather than every frame, and the last action is repeated on skipped frames. Since running the simulator one step forward requires much less computation than letting the agent choose an action, this technique allows the agent to perform approximately k times to increase the number of games without significantly increasing the run time. We use k=4 for all games. First, when encoding a single frame, we take the maximum value of each pixel’s color value in the frame being encoded and the previous frame. In the game, some objects only appear in even frames, while other objects only appear in odd frames. This is due to the limited number of NPCs that the Atari 2600 can display at one time.

3. The Use of MaxAndSkipEnv 【Environment Preprocessing】

In a life turn-based game, such as a level-breaking game, this wrapper will mark the end of the life value as the end of the episode.

The source of this wrapper is from (Mnih et al., 2015) as shown below

For games where there is a life counter, the Atari 2600 emulator also sends the number of lives left in the game, which is then used to mark the end of an episode during training.

Interestingly, (Bellemare et al., 2016) suggested that this wrapper might harm the agent’s performance, while Machado et al., 2018) recommended not to use this wrapper.

4. The Use of FireResetEnv【Environment Preprocessing】

This wrapper takes the FIRE action on reset for environments that are fixed until firing.

This wrapper is interesting because as far as we know there is no literature reference. According to anecdotal conversations (openai/baselines#240), neither DeepMind nor OpenAI people have any idea where this wrapper came from.

5. The Use of WarpFrame (Image transformation) [Environment Preprocessing]

This wrapper extracts the Y channel of a 210×160 pixel image and resizes it to 84×84.

As shown in the quotation below, this wrapper is derived from (Mnih et al., 2015).

Second, we then extract the Y channel, also known as luminance, from the RGB frame and rescale it to 84×84.

In our implementation, we use the following wrapper to achieve the same purpose.

env = gym.wrappers.ResizeObservation(env, (84, 84))
env = gym.wrappers.GrayScaleObservation(env)

6. The Use of ClipRewardEnv【Environment Preprocessing】

The wrapper divides the reward into { + 1, 0, -1} by symbol.
As shown in the quote below, this wrapper comes from (Mnih et al., 2015).

As the scale of scores varies greatly from game to game, we clipped all positive rewards at 1 and all negative rewards at -1, leaving 0 rewards unchanged. Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games. At the same time, it could affect the performance of our agent since it cannot differentiate between rewards of different magnitude.

7. The Use of FrameStack【Environment Preprocessing】

This wrapper stacks m last frames so that the agent can infer the speed and direction of the moving object.

As shown in the quote below, this wrapper comes from (Mnih et al., 2015).

The function θ from algorithm 1 described below applies this preprocessing to the m most recent frames and stacks them to produce the input to the Q-function, in which m=4.

8. Shared Nature-CNN network for the policy and value functions 【Neural Network】

For Atari games, PPO uses the same convolutional neural network (CNN) as in (Mnih et al., 2015) and the layer initialization technique mentioned earlier (baselines/a2c/utils.py#L52-L53) to extract features, flatten the extracted features, and apply linear layers to calculate hidden features. Afterwards, by using hidden features to construct the policy head and value head, the policy function and the value function share parameters. Here is a pseudocode:

 hidden = Sequential(
      layer_init(Conv2d(4, 32, 8, stride=4)),
      ReLU(),
      layer_init(Conv2d(32, 64, 4, stride=2)),
      ReLU(),
      layer_init(Conv2d(64, 64, 3, stride=1)),
      ReLU(),
      Flatten(),
      layer_init(Linear(64 * 7 * 7, 512)),
      ReLU(),
  )
  policy = layer_init(Linear(512, envs.single_action_space.n), std=0.01)
  value = layer_init(Linear(512, 1), std=1)

This parameter sharing mode is significantly faster to compute than setting up completely separate networks (shown below).

 policy = Sequential(
      layer_init(Conv2d(4, 32, 8, stride=4)),
      ReLU(),
      layer_init(Conv2d(32, 64, 4, stride=2)),
      ReLU(),
      layer_init(Conv2d(64, 64, 3, stride=1)),
      ReLU(),
      Flatten(),
      layer_init(Linear(64 * 7 * 7, 512)),
      ReLU(),
      layer_init(Linear(512, envs.single_action_space.n), std=0.01)
  )
  value = Sequential(
      layer_init(Conv2d(4, 32, 8, stride=4)),
      ReLU(),
      layer_init(Conv2d(32, 64, 4, stride=2)),
      ReLU(),
      layer_init(Conv2d(64, 64, 3, stride=1)),
      ReLU(),
      Flatten(),
      layer_init(Linear(64 * 7 * 7, 512)),
      ReLU(),
      layer_init(Linear(512, 1), std=1)
  )

However, recent research suggests that balancing the competing policy and value objective can be problematic, which is what approaches such as Phasic Policy Gradient attempt to address (Cobbe et al., 2021).

9. Scaling the Images to Range [0, 1] 【Environment Preprocessing】

The range of the input data is [0,255], but it should be divided by 255 to get the range [0,1].

Our experiments found that this expansion is important. Without it, the first policy update will cause Kullback-Leibler divergence to explode, probably due to the way the layer is initialized.

To run the experiments, we matched the hyperparameters used in the original implementation as follows.

# https://github.com/openai/baselines/blob/master/baselines/ppo2/defaults.py
def atari():
    return dict(
        nsteps=128, nminibatches=4,
        lam=0.95, gamma=0.99, noptepochs=4, log_interval=1,
        ent_coef=.01,
        lr=lambda f : f * 2.5e-4,
        cliprange=0.1,
    )

These hyperparameters are

Note that the number of environments parameter N (i.e. num_envs) is set to the number of CPUs of the computer (common/cmd_util.py#L167), which is strange. We choose to match the N=8 used in the paper (which sets this parameter to “number of actors, 8″).

As shown below, we modified about 40 lines of code to ppo.py to incorporate these 9 details, resulting in a standalone ppo_atari.py (link) containing 339 lines of code. The image below shows the file differences between ppo.py (left) and ppo_atari.py (right). Please check it out in the original blog.