-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Hi there,
I'm opening this issue and assigning myself to it. I recently made a PR to RL.jl to add a RewardNormalizer.
I thought about it and I think it was not done the best way because it still must be "hard coded" into the algorithms that must use it. I think I figured out a way to implement a better solution via this repo.
Normalization is a two-phase thing:
- When generating new experience, the online stats are updated. This is done when pushing experience to the trajectory.
- When sampling to update a learner (or for any reason), normalize with the latest stats. This is done when fetching experience from the trajectory.
Said like that, it is clear that normalization is a trajectory thing. My proposition is to create a (several) trajectory wrappers to add a normalizer field. Roughly, we can make this as follows
push!(a_normalized_trajectory[:trace], data)
first updatesa_normalized_trajectory.normalizer
then does the normal push toa_normalized_trajectory.trajectory[:trace]
.(a_normalized_trajectory.trajectory.sampler)(a_normalized_trajectory)
first samples with(sampler)(a_normalized_trajectory.trajectory)
but normalizes the traces with before returning.
Note that I used :trace
above, that's because this does not have to be restricted to rewards, state normalization is also very common in RL and I believe it could work just the same way.
Some notes
- We should use OnlineStats.jl instead of a homemade version.
- We must be careful about some samplers, I mainly think about NStepSampler where the sampled reward is a discounted sum of rewards and thus the normalization must be done per reward. This indicates that the normalization should be done a the earliest stage of sampling, not the latest (unlike what I describe above).
- We must think about how to deal with async trajectories. I think nothing must be done on the workers' side.
- Return normalization can also be done much more easily with the new trajectory design.