huggingface · burtenshaw · Oct 1, 2025
diff --git a/units/en/unit1/deep-rl.mdx b/units/en/unit1/deep-rl.mdx
@@ -1,21 +1,20 @@
-# The “Deep” in Reinforcement Learning [[deep-rl]]
-
-<Tip>
-What we've talked about so far is Reinforcement Learning. But where does the "Deep" come into play?
-</Tip>
-
-Deep Reinforcement Learning introduces **deep neural networks to solve Reinforcement Learning problems** — hence the name “deep”.
-
-For instance, in the next unit, we’ll learn about two value-based algorithms: Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning.
-
-You’ll see the difference is that, in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
-
-In the second approach, **we will use a Neural Network** (to approximate the Q value).
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" alt="Value based RL"/>
-<figcaption>Schema inspired by the Q learning notebook by Udacity
-</figcaption>
-</figure>
-
-If you are not familiar with Deep Learning you should definitely watch [the FastAI Practical Deep Learning for Coders](https://course.fast.ai) (Free).
+# The “Deep” in Reinforcement Learning [[deep-rl]]
+
+> [!TIP]
+> What we've talked about so far is Reinforcement Learning. But where does the "Deep" come into play?
+
+Deep Reinforcement Learning introduces **deep neural networks to solve Reinforcement Learning problems** — hence the name “deep”.
+
+For instance, in the next unit, we’ll learn about two value-based algorithms: Q-Learning (classic Reinforcement Learning) and then Deep Q-Learning.
+
+You’ll see the difference is that, in the first approach, **we use a traditional algorithm** to create a Q table that helps us find what action to take for each state.
+
+In the second approach, **we will use a Neural Network** (to approximate the Q value).
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/deep.jpg" alt="Value based RL"/>
+<figcaption>Schema inspired by the Q learning notebook by Udacity
+</figcaption>
+</figure>
+
+If you are not familiar with Deep Learning you should definitely watch [the FastAI Practical Deep Learning for Coders](https://course.fast.ai) (Free).
diff --git a/units/en/unit1/rl-framework.mdx b/units/en/unit1/rl-framework.mdx
@@ -1,144 +1,143 @@
-# The Reinforcement Learning Framework [[the-reinforcement-learning-framework]]
-
-## The RL Process [[the-rl-process]]
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process.jpg" alt="The RL process" width="100%">
-<figcaption>The RL Process: a loop of state, action, reward and next state</figcaption>
-<figcaption>Source: <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto</a></figcaption>
-</figure>
-
-To understand the RL process, let’s imagine an agent learning to play a platform game:
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
-
-- Our Agent receives **state  \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
-- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
-- The environment goes to a **new** **state \\(S_1\\)** — new frame.
-- The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*.
-
-This RL loop outputs a sequence of **state, action, reward and next state.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/sars.jpg" alt="State, Action, Reward, Next State" width="100%">
-
-The agent's goal is to _maximize_ its cumulative reward, **called the expected return.**
-
-## The reward hypothesis: the central idea of Reinforcement Learning [[reward-hypothesis]]
-
-⇒ Why is the goal of the agent to maximize the expected return?
-
-Because RL is based on the **reward hypothesis**, which is that all goals can be described as the **maximization of the expected return** (expected cumulative reward).
-
-That’s why in Reinforcement Learning, **to have the best behavior,** we aim to learn to take actions that **maximize the expected cumulative reward.**
-
-
-## Markov Property [[markov-property]]
-
-In papers, you’ll see that the RL process is called a **Markov Decision Process** (MDP).
-
-We’ll talk again about the Markov Property in the following units. But if you need to remember something today about it, it's this: the Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states and actions** they took before.
-
-## Observations/States Space [[obs-space]]
-
-Observations/States are the **information our agent gets from the environment.** In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.
-
-There is a differentiation to make between *observation* and *state*, however:
-
-- *State s*: is **a complete description of the state of the world** (there is no hidden information). In a fully observed environment.
-
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/chess.jpg" alt="Chess">
-<figcaption>In chess game, we receive a state from the environment since we have access to the whole check board information.</figcaption>
-</figure>
-
-In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.
-
-- *Observation o*: is a **partial description of the state.** In a partially observed environment.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
-<figcaption>In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.</figcaption>
-</figure>
-
-In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.
-
-In Super Mario Bros, we are in a partially observed environment. We receive an observation **since we only see a part of the level.**
-
-<Tip>
-In this course, we use the term "state" to denote both state and observation, but we will make the distinction in implementations.
-</Tip>
-
-To recap:
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/obs_space_recap.jpg" alt="Obs space recap" width="100%">
-
-
-## Action Space [[action-space]]
-
-The Action space is the set of **all possible actions in an environment.**
-
-The actions can come from a *discrete* or *continuous space*:
-
-- *Discrete space*: the number of possible actions is **finite**.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
-<figcaption>In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).</figcaption>
-
-</figure>
-
-Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.
-
-- *Continuous space*: the number of possible actions is **infinite**.
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/self_driving_car.jpg" alt="Self Driving Car">
-<figcaption>A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…
-</figcaption>
-</figure>
-
-To recap:
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/action_space.jpg" alt="Action space recap" width="100%">
-
-Taking this information into consideration is crucial because it will **have importance when choosing the RL algorithm in the future.**
-
-## Rewards and the discounting [[rewards]]
-
-The reward is fundamental in RL because it’s **the only feedback** for the agent. Thanks to it, our agent knows **if the action taken was good or not.**
-
-The cumulative reward at each time step **t** can be written as:
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_1.jpg" alt="Rewards">
-<figcaption>The cumulative reward equals the sum of all rewards in the sequence.
-</figcaption>
-</figure>
-
-Which is equivalent to:
-
-<figure>
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_2.jpg" alt="Rewards">
-<figcaption>The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) + ...
-</figcaption>
-</figure>
-
-However, in reality, **we can’t just add them like that.** The rewards that come sooner (at the beginning of the game) **are more likely to happen** since they are more predictable than the long-term future reward.
-
-Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse's goal is **to eat the maximum amount of cheese before being eaten by the cat.**
-
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_3.jpg" alt="Rewards" width="100%">
-
-As we can see in the diagram, **it’s more probable to eat the cheese near us than the cheese close to the cat** (the closer we are to the cat, the more dangerous it is).
-
-Consequently, **the reward near the cat, even if it is bigger (more cheese), will be more discounted** since we’re not really sure we’ll be able to eat it.
-
-To discount the rewards, we proceed like this:
-
-1. We define a discount rate called gamma. **It must be between 0 and 1.** Most of the time between **0.95 and 0.99**.
-- The larger the gamma, the smaller the discount. This means our agent **cares more about the long-term reward.**
-- On the other hand, the smaller the gamma, the bigger the discount. This means our **agent cares more about the short term reward (the nearest cheese).**
-
-2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, **so the future reward is less and less likely to happen.**
-
-Our discounted expected cumulative reward is:
-<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_4.jpg" alt="Rewards" width="100%">
+# The Reinforcement Learning Framework [[the-reinforcement-learning-framework]]
+
+## The RL Process [[the-rl-process]]
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process.jpg" alt="The RL process" width="100%">
+<figcaption>The RL Process: a loop of state, action, reward and next state</figcaption>
+<figcaption>Source: <a href="http://incompleteideas.net/book/RLbook2020.pdf">Reinforcement Learning: An Introduction, Richard Sutton and Andrew G. Barto</a></figcaption>
+</figure>
+
+To understand the RL process, let’s imagine an agent learning to play a platform game:
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/RL_process_game.jpg" alt="The RL process" width="100%">
+
+- Our Agent receives **state  \\(S_0\\)** from the **Environment** — we receive the first frame of our game (Environment).
+- Based on that **state \\(S_0\\),** the Agent takes **action \\(A_0\\)** — our Agent will move to the right.
+- The environment goes to a **new** **state \\(S_1\\)** — new frame.
+- The environment gives some **reward \\(R_1\\)** to the Agent — we’re not dead *(Positive Reward +1)*.
+
+This RL loop outputs a sequence of **state, action, reward and next state.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/sars.jpg" alt="State, Action, Reward, Next State" width="100%">
+
+The agent's goal is to _maximize_ its cumulative reward, **called the expected return.**
+
+## The reward hypothesis: the central idea of Reinforcement Learning [[reward-hypothesis]]
+
+⇒ Why is the goal of the agent to maximize the expected return?
+
+Because RL is based on the **reward hypothesis**, which is that all goals can be described as the **maximization of the expected return** (expected cumulative reward).
+
+That’s why in Reinforcement Learning, **to have the best behavior,** we aim to learn to take actions that **maximize the expected cumulative reward.**
+
+
+## Markov Property [[markov-property]]
+
+In papers, you’ll see that the RL process is called a **Markov Decision Process** (MDP).
+
+We’ll talk again about the Markov Property in the following units. But if you need to remember something today about it, it's this: the Markov Property implies that our agent needs **only the current state to decide** what action to take and **not the history of all the states and actions** they took before.
+
+## Observations/States Space [[obs-space]]
+
+Observations/States are the **information our agent gets from the environment.** In the case of a video game, it can be a frame (a screenshot). In the case of the trading agent, it can be the value of a certain stock, etc.
+
+There is a differentiation to make between *observation* and *state*, however:
+
+- *State s*: is **a complete description of the state of the world** (there is no hidden information). In a fully observed environment.
+
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/chess.jpg" alt="Chess">
+<figcaption>In chess game, we receive a state from the environment since we have access to the whole check board information.</figcaption>
+</figure>
+
+In a chess game, we have access to the whole board information, so we receive a state from the environment. In other words, the environment is fully observed.
+
+- *Observation o*: is a **partial description of the state.** In a partially observed environment.
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
+<figcaption>In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.</figcaption>
+</figure>
+
+In Super Mario Bros, we only see the part of the level close to the player, so we receive an observation.
+
+In Super Mario Bros, we are in a partially observed environment. We receive an observation **since we only see a part of the level.**
+
+> [!TIP]
+> In this course, we use the term "state" to denote both state and observation, but we will make the distinction in implementations.
+
+To recap:
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/obs_space_recap.jpg" alt="Obs space recap" width="100%">
+
+
+## Action Space [[action-space]]
+
+The Action space is the set of **all possible actions in an environment.**
+
+The actions can come from a *discrete* or *continuous space*:
+
+- *Discrete space*: the number of possible actions is **finite**.
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/mario.jpg" alt="Mario">
+<figcaption>In Super Mario Bros, we have only 4 possible actions: left, right, up (jumping) and down (crouching).</figcaption>
+
+</figure>
+
+Again, in Super Mario Bros, we have a finite set of actions since we have only 4 directions.
+
+- *Continuous space*: the number of possible actions is **infinite**.
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/self_driving_car.jpg" alt="Self Driving Car">
+<figcaption>A Self Driving Car agent has an infinite number of possible actions since it can turn left 20°, 21,1°, 21,2°, honk, turn right 20°…
+</figcaption>
+</figure>
+
+To recap:
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/action_space.jpg" alt="Action space recap" width="100%">
+
+Taking this information into consideration is crucial because it will **have importance when choosing the RL algorithm in the future.**
+
+## Rewards and the discounting [[rewards]]
+
+The reward is fundamental in RL because it’s **the only feedback** for the agent. Thanks to it, our agent knows **if the action taken was good or not.**
+
+The cumulative reward at each time step **t** can be written as:
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_1.jpg" alt="Rewards">
+<figcaption>The cumulative reward equals the sum of all rewards in the sequence.
+</figcaption>
+</figure>
+
+Which is equivalent to:
+
+<figure>
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_2.jpg" alt="Rewards">
+<figcaption>The cumulative reward = rt+1 (rt+k+1 = rt+0+1 = rt+1)+ rt+2 (rt+k+1 = rt+1+1 = rt+2) + ...
+</figcaption>
+</figure>
+
+However, in reality, **we can’t just add them like that.** The rewards that come sooner (at the beginning of the game) **are more likely to happen** since they are more predictable than the long-term future reward.
+
+Let’s say your agent is this tiny mouse that can move one tile each time step, and your opponent is the cat (that can move too). The mouse's goal is **to eat the maximum amount of cheese before being eaten by the cat.**
+
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_3.jpg" alt="Rewards" width="100%">
+
+As we can see in the diagram, **it’s more probable to eat the cheese near us than the cheese close to the cat** (the closer we are to the cat, the more dangerous it is).
+
+Consequently, **the reward near the cat, even if it is bigger (more cheese), will be more discounted** since we’re not really sure we’ll be able to eat it.
+
+To discount the rewards, we proceed like this:
+
+1. We define a discount rate called gamma. **It must be between 0 and 1.** Most of the time between **0.95 and 0.99**.
+- The larger the gamma, the smaller the discount. This means our agent **cares more about the long-term reward.**
+- On the other hand, the smaller the gamma, the bigger the discount. This means our **agent cares more about the short term reward (the nearest cheese).**
+
+2. Then, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, **so the future reward is less and less likely to happen.**
+
+Our discounted expected cumulative reward is:
+<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/rewards_4.jpg" alt="Rewards" width="100%">