Q-Learning Maze Runner

TriWei AI Lab

Q-Learning Maze Runner

Train an agent with epsilon-greedy exploration and watch value estimates shape an emergent policy.

Q-value updates Policy heatmaps

Gridworld layout with policy arrows and rewards.

How to play + what to look for

Goal: learn a policy by trial and error using Q‑learning with ε‑greedy exploration.
Step = one action update. Episode = run until terminal or max steps.
Watch arrows converge as Q values stabilize.
Keyboard: S=Step, E=Episode, R=Reset.

Learning objectives

Concept focus: understand how Q‑learning iteratively updates state–action values using reward signals and discounting.
Core definition: the Q‑update \(Q(s,a) ← (1−α)Q(s,a) + α[r + γ\max_{a'}Q(s',a')]\) blends old estimates with new sampled returns.
Common mistake: setting the discount factor too high or learning rate too large can cause divergence in Q values.
Why it matters: Q‑learning underpins many reinforcement learning algorithms and illustrates how agents can learn optimal policies without a model.
Toy disclaimer: this 8×8 gridworld is a small environment; real RL problems use larger state spaces and function approximators.

Watch a tiny agent learn to reach a goal in a gridworld using ε‑greedy Q‑learning. This is a classic RL toy problem—excellent for intuition, not a substitute for real-world training.

α (learning rate): 0.30 γ (discount): 0.95 ε (exploration): 0.15 α decay ε decay

Gridworld

Episode: 0

Steps: 0

Return: 0

What you are seeing

Each state has 4 Q-values (up/right/down/left). Arrows show the greedy action.
ε controls random exploration; α controls how fast Q updates; γ controls future reward weight.
Step cost encourages shorter paths; traps end the episode with negative reward.

Q-learning update (Sutton & Barto, RL: An Introduction): SuttonBartoIPRLBook2ndEd.pdf.

Collaboration Credits

These interactive labs are the result of a close collaboration between a human author and an AI assistant (ChatGPT). The AI contributed algorithmic refinements, numerical safeguards and visual improvements, while the human designed the pedagogical structure, reviewed all code, and ensured educational accuracy. Mathematical formulas and derivations are referenced to reputable course notes and textbooks. All code runs entirely in the browser; no data is sent to any server.