- Sabrina Ramonov 🍄
- Posts
- Deep Reinforcement Learning for Agents: Huggy and Doom
Deep Reinforcement Learning for Agents: Huggy and Doom
Awesome Examples of Pre-Generative AI Agents
If you’re not sure what AI agents are, this post is for you.
AI agents have been around long before LLMs and ChatGPT.
What is an AI agent?
an autonomous entity, operating without human intervention
that gets feedback from its environment
to make decisions and achieve goals
Examples:
Voice Assistants: Siri, Alexa, Google Assistant.
Gaming: Starcraft, Dota bots.
Chatbots: Customer service chatbots to solve customer issues.
These agents were around way before ChatGPT, before LLMs, although they were not as flexible and adaptable.
In this post, I share awesome examples of pre-generative AI agents, explaining how they’re trained and how they work:
Here’s a Youtube version of this post:
Deep Reinforcement Learning
A popular method for training agents is deep reinforcement learning.
It enables agent to learn and make decisions in complex environments through trial and error, learning directly from interactions with their environments, guided by maximizing expected reward.
As the name suggests, deep reinforcement learning combines:
Reinforcement Learning
Type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize rewards. The agent receives feedback in the form of rewards or penalties, helping it to learn optimal behaviors over time.
For example, giving your dog a treat for good behavior teaches your dog to do more of that good behavior.
Deep Learning
Subset of machine learning using neural networks with many layers (i.e. deep neural networks). In deep reinforcement learning, these networks are used to approximate value functions or policies.
Key Components
Here are the key components of deep reinforcement learning:
Agent: The learner or decision-maker.
Environment: The external system with which the agent interacts.
State: The current situation of the agent.
Action: All possible moves the agent can take in a given state.
Reward: Feedback from the environment based on the action taken.
Policy: The strategy used to determine actions based on states.
Value Function: A measure of how desirable the current state is.
How It Works
Here’s how the components all come together to train an AI agent:
Exploration vs. Exploitation: agent explores its environment to gather information (exploration) and uses this information to make decisions that maximize rewards (exploitation).
Learning Process: through repeated interactions, agent learns from rewards and penalties, adjusting its policy to improve performance.
Neural Networks: used to approximate the policy or value function.
Huggy
Huggy is an adorable project developed by HuggingFace, based on Puppo the Corgi by the Unity ML-Agents team.
The environment uses the Unity game engine and the ML-Agents toolkit, allowing the creation of environments to train agents.
In this case, Huggy learns to play fetch!
Objective
The primary goal in this environment is to train Huggy to fetch a stick.
To accomplish this, Huggy must move correctly towards the stick based on the information provided to him about his environment.
State Space
In reinforcement learning, the state space defines what the agent perceives.
Huggy can’t visually see his surroundings!
He only gets specific information:
Position of the target (stick)
Relative position between himself and the target
Orientation of his legs
Then, Huggy uses his policy to determine the next best actions.
Action Space
The action space is the set of all possible moves Huggy can take.
Huggy's movements are controlled by joint motors that drive his legs.
The action space consists of these movements:
Huggy learns to rotate the joint motors of each leg to move to the stick.
Reward Function
The reward function reinforces desirable behaviors and penalizes undesirable behaviors.
Here are the components of the reward function:
Training Huggy
To train Huggy, we teach him to run efficiently towards the stick.
At each step in time, Huggy must:
Observe the environment
Decide how to rotate each joint motor, without spinning
The training environment is designed with multiple copies where a stick spawns randomly.
When Huggy reaches the stick, it respawns elsewhere, providing diverse experiences and speeding up the training process.
Here’s Huggy learning to play fetch!
Try it yourself:
Doom
Here’s another example of AI agents trained via deep reinforcement learning, not exactly adorable but still very cool:
Teaching an AI agent to survive Doom without any prior knowledge.
The agent only knows:
life is desirable
death is undesirable
It must learn how to stay alive, recognizing that health is required for survival.
Eventually, the agent learns to collect health packs in order to survive.
This is built with VizDoom, an open-source python library to train AI agents to play Doom using only visual information.
VizDoom enables training directly from screen pixels.
In this example, our Doom AI agent plays the Health Gathering level.
Proximal Policy Optimization (PPO)
Training this agent uses the technique, Proximal Policy Optimization (PPO).
Remember, policy refers to the strategy an AI agent employs to determine its next actions.
In traditional policy optimization, making large updates to the policy can destabilize training.
This leads to poor performance.
PPO ensures policy updates are gradual and controlled.
It prevents drastic changes that could derail learning, keeping all updates within a safe range (known as “clipping”).
This added stability leads to more reliable learning and better performance in complex environments.
Environment Setup
Our AI agent’s objective is to learn how to survive.
But at the start, it doesn’t know what will help it survive.
Over time, the AI agent must learn that health is required for survival and medical kits (aka “medkits”) replenish health.
Here’s the environment:
Map: A rectangular space enclosed by walls with a hazardous green, acidic floor that periodically damages the agent.
Medkits: Initially scattered uniformly across the map, with additional medkits appearing intermittently. These medkits restore portions of the agent’s health, essential for survival.
Episode End: The simulation ends when the agent dies or after a timeout period.
Configuration Details
Reward: 1 point for living, incentivizing survival.
Penalty: 100 points for dying, teaching the agent not to repeat the actions leading to demise.
Action Space:
Turn left
Turn right
Move forward
Game Variable: Health, which the agent learns is connected to living.
The agent must navigate this environment, utilizing medkits to mitigate health loss from the acidic floor, making strategic decisions to prolong survival.
Here’s the Doom AI agent learning to survive!
Last Thoughts
If you enjoyed this post, I would love to hear from you!
Just hit reply and let me know if you’d like to see more posts about AI agents, including generative AI agents and multi agent systems!