Introduction to Reinforcement Learning

Table of Contents⚓︎

Motivational Introduction to RL
RL Framework and Applications

Motivational Intro to RL⚓︎

In behavioral psychology, researcher always pauses questions like how a human able to learn and grasp knowledge, information because a newly born child don't have any kind of information about the physical world how different entities behave to each other or even how different natural phenomenon happens. As child grows, it learn every thing that is happening all around itself to adapt those enviroment. Thus, RL draws inspiration from the behavioral psychology, focusing on how agents(human) learn optimal behaviors through interactions with their environment by receiving rewards and punishments.

Pavlov's Dog Experiment and Conditioning Repsonse

Pavlov's dog experiment, a classic demonstration of classical conditioning , involved training dogs to salivate at the sound of a bell by repeatedly pairing the bell with food .Initially, the bell (neutral stimulus) wouldn't elicit salivation, but after pairing it with food (unconditioned stimulus) , the dogs eventually salivated at the bell alone conditioned stimulus and conditioned response

What observation we see from this experiments

Unconditioned Stimulus : Food that natually triggers salivation in dogs.
Unconditioned Response : Salvation in response to the food.
Neutral Stimulus : A bell, initially has no effect on salivation.
Conditioned Stimulus : The bell, after being repeatedly paired with the food, then it become a conditioned stimulus.
Conditioned Response : Salivation in response to the bell alone.

Hence, Pavlov's experiment demonstrated how a neutral stimulus can acquire the ability to elicit a learned or conditioned response through association with an unconditioned stimulus. Its analogous to the learning through rewards and stimuli associations and this fundamental concept has had a profound impact on psychology and our understanding of learning.

In RL, we have agent that can take action inside the enviroment such that agent's action can affect the state of enviroment therefore, agent will get rewards for each of it's best or bad action accordingly after observation of state of environment. Sometimes, agent may get immediate rewards after taking it's action but in many real world scenarios, rewards do not follows actions immediately but after some delay, making learning more challenging.

Delay Rewards ? In many real-world scenarios, rewards do not follow actions immediately but after some delay, making learning more challenging.

Noisy and Stochastic Enviroments ? Environments in which the outcomes of actions are uncertain or subject to noise, requiring robust learning methods.

A case study of Backgammon and Neural Networks

Backgammon is a two-player board game played with dice and checkers on a board divided into 24 points.The objective is to be the first to move all your checkers around the board and off.Players alternate turns rolling dice and moving their checkers according to the numbers rolled.The game involves strategy in blocking and hitting opponent's checkers, with the ultimate goal of bearing off all checkers before the opponent. It has a long history, with origins dating back thousands of years to games like "nard" in Persia. It's a game of both luck and skill, offering strategic depth and exciting gameplay.

What are the key concepts of backgammon ?

Board : It is divided into the 4 quadrants, each with 6 points, totaling 24 points.
Checkers : Each player has 15 checkers
Dice : Players roll two dice each turn to determine the number of points to move their checkers.
Movement : Checkers move forward(counterclockwise for one player, clockwise for the other player).
Hitting : Landing on a single opponent's checker(a blot) sends it to the bar.
Bearing Off : Once all checkers are in a players home board, they can start bearing them off the board.
Boubling Cube : A doubling cube is used to increase the stakes of the game.

How to play backgammon ?

Setup: Players place their checkers on specific points as defined by the rules.
Rolling: Players roll dice to determine who goes first and the numbers to move checkers.
Moving: Checkers move according to the numbers rolled, following the rules of movement and hitting.
Entering: Checkers on the bar must be entered back into the opponent's home board before other moves can be made.
Bearing Off: Players bear off checkers once all are in their home board.
Winning: The first player to bear off all their checkers wins the game

Neural Backgammon ? It was an early attempts to apply neural networks to play Backgammon using supervised learning on expert moves.

TD-Gammon ?

A breakthrough RL program developed by Gerald Tesauro that learned to play Backgammon at a grandmaster level by playing against itself. It combines Temporal Difference (TD) learning with neural networks.

Uses a multilayer neural network as a value function approximator.

Employs the $TD(λ)$ algorithm to update weights based on differences between successive predictions, enabling learning from delayed rewards.

Self-play approach: one network version is frozen while the other learns, facilitating gradual improvement of the opponent.

Online Learning : TD-Gammon learns without prior explicit feedback, purely from the outcomes of its own plays.

Self-Play ? It a concept, where agent learns by playing agaist itself, gradually imporving both player's strategies.

Online learning ? Learning that occurs continuously through interactions with the environment without relying on a pre-existing dataset or prior feedback.

Go is a two-player abstract strategy board game that originated in China more than 2,500 years ago. The game is played on a grid board with black and white stones, and the objective is to control more territory than the opponent by surrounding empty spaces. Go is popular in East Asian countries like China (where it's called Weiqi), Korea (Baduk), and Japan (Igo). It has gained a following worldwide, especially after World War II.

Basic of the Go game ?

Board : Go is played on a grid board, typically with 19x19 lines, creating 361 intersections.
Stones : Players take turns placing their stones(black or white) on thse intersections.
Objective : The goal is to enclose more territory(empty intersections) than the opponent.
Captures : Stones can be captured if they are completely surrounded by the opponent's stones.
No movement : Once placed, stones remain on the board unless captured; they don't move.
Pass: player can choose to pass their turn if they have not advantageous moves.
Game End: The game ends when both players pass consecutively, or when there are no more meaningful moves to make
Scoring : Points are awarded for captured stones and enclosed territory.
Liberties: A liberty is an empty adjacent intersection to a stone or group of stones.
Groups: Stones of the same color connected horizontally or vertically form a group.
Territory: Enclosed empty spaces surrounded by a player's stones constitute their territory.
Ko: A specific rule preventing infinite repetition of captures.

Applications of RL(TD-Gammon) goes beyond the games such as computational advertising, where we agents optimize ad placements and bidding strategies based on the feedback signals. Thus, TD-Gammon is a canonical example which demonstrats the power of reinforcement learning combined with neural networks to solve complex problems involving delayed rewards and large decision spaces in stochastic environments through self-play and online learning.

RL Framework and Applications⚓︎

The core RL framework, its workings, challenges, and how TD learning and other concepts address the unique nature of sequential decision-making under uncertainty.

Background and Context
RL differs from Supervised Learning (SL) primarily because it handles sequences of decisions rather than independent input-output pairs.
Unlike SL, RL does not have a direct target output. It only gets a reward signal (e.g., 3), but does not know if 3 is good or bad compared to other possible outcomes.
The goal is to maximize long-term cumulative rewards , learning to do better than just achieving the current reward.
Core Ideas of RL
Trial and Error Learning: Agents experiment with actions to find which yield better rewards.
Prediction and Outcome Estimation: RL focuses on predicting future rewards (or probability of winning) conditional on current actions.
Prediction accuracy improves over time; the agent uses knowledge at time t+1 to update decisions at time t.
This concept parallels how animals learn using Temporal Difference (TD) Learning , adapting behavior based on outcomes of past actions.
Example: Tic Tac Toe
The environment is a game with episodes, where agents learn by playing multiple games.
The reward is usually 1 for a win and 0 otherwise.
The value function estimates which moves lead to states with a higher probability of winning .
Waiting until the end of an episode to evaluate is inefficient; TD learning allows online updates after each state transition.
Reinforcement Learning Components

Component	Description
Agent	The learner or decision-maker that takes actions.
Environment	The world with which the agent interacts, providing states and rewards based on actions.
State	Representation of the environment at a time step.
Action	Choices available to the agent at each state.
Reward	Scalar feedback signal indicating the immediate value of an action.
Policy	The strategy mapping states to actions.
Value Function	Estimates expected cumulative rewards starting from a given state (or state-action pair).
Model	(Optional) Predicts next states and rewards, useful in model-based RL for planning.

Trade-offs in RL Learning
Exploration vs. Exploitation:
Exploration means trying new actions to discover their values.
Exploitation means choosing the best-known action to maximize reward.

Deciding when to explore or exploit is a core challenge. * Agents should try every action at least once to gather information, especially in stochastic environments .
Update Mechanisms
Delayed Reward Challenge:

Rewards might come long after the actions that caused them, requiring the agent to assign credit properly. * TD Learning: Updates value estimates based on immediate next state prediction errors without waiting for episode end. * This leads to faster and more efficient learning.

Special Cases and Real-World Notes
Some environments have repeated states reachable through different action sequences.
In complex domains like Atari games, immediate rewards may not be explicit; RL agents may infer rewards (for example, from game scores).
Not all games or tasks are fully solved by RL; some (like Seaquest in Atari) remain challenging.
Practical Applications
Classic board games (Backgammon, Go) using self-play to improve without prior knowledge.
Complex systems like computational advertising use RL to optimize real-time decisions under uncertainty.
Robotics, autonomous control, finance, and more use RL’s sequential decision-making ability.

Intro to Immediate RL⚓︎

foundation for Immediate RL under stochastic, stateless environments with rewards sampled from unknown but stationary distributions. It forms the basis for understanding bandit algorithms and the wider sequential decision-making frameworks.

the concept of Immediate Reinforcement Learning (Immediate RL), focusing on immediate rewards, stateless environments, and the exploration-exploitation dilemma:

Definition and Setting
Immediate RL deals with scenarios where the agent receives immediate rewards after taking actions, without state transitions or delayed consequences.
The environment is considered stateless or has the same state every time , meaning the outcome depends only on the action taken, not on any previous state or history.
Stochastic Outcomes and Reward Distributions
Outcomes are stochastic , i.e., rewards from the environment are drawn from a probability distribution.
The agent often needs to perform multiple trials to estimate the statistical properties (mean, variance) of rewards from each action.
Exploration vs. Exploitation Dilemma
The agent must balance:
Exploration : Trying different actions to learn about their reward distributions.
Exploitation : Selecting the action currently believed to yield the highest reward.
This balance is critical since greedily picking the best-known action can prevent discovering better options.
Concept of Reward
Reward (from psychology), Payoff (from economics), Evaluation (from optimization), or Cost (from control theory) are synonymous notions describing feedback signals guiding decisions.
Rewards can be sampled from probabilistic distributions, for example:
Bernoulli Distribution : Think of a coin toss with rewards 1 (head) or 0 (tail).
Each action corresponds to a different "coin" with an unknown probability of landing heads.
The agent receives rewards by "sampling" from these distributions.
Sampling From Probability Distributions
Example for Bernoulli:
The reward is 1 with probability ppp and 0 with probability 1−p1-p1−p.
For Gaussian or continuous rewards:
xxx axis corresponds to different reward values.
yyy axis corresponds to the probability of each reward.
Sampling procedure :
Generate a uniform random number uuu in.
Use the Cumulative Distribution Function (CDF) of the distribution to find the corresponding reward value.
This models randomness in observed rewards.
Stationary Reward Distributions
The reward distribution for each action remains stationary or unchanged over time.
The agent’s task is to estimate these distributions' parameters to maximize expected reward.
Objective: Maximizing Expected Payoff
The agent aims to pick the arm (action) with the highest expected payoff (mean reward) .
However, the true distribution generating rewards is unknown , so the agent must estimate it from observations.
Connection to Multi-Armed Bandit Problem
Immediate RL closely mirrors the classic multi-armed bandit problem:
Multiple actions (“arms”) each with an unknown reward distribution.
Need to balance exploration and exploitation to maximize cumulative rewards.

Bandit Optimalities⚓︎

the fundamental notions of optimality in bandits, explaining how different perspectives fit into the problem of learning under uncertainty through trial and error.

Bandit Optimalities covering core concepts, problem types, and solution perspectives:

Introduction to Bandit Problems
The term bandit comes from the analogy of a gambler facing multiple slot machines ("arms") that can "steal" money.
Bandit problems involve making decisions to maximize cumulative reward when each choice (arm) yields stochastically distributed payoffs.
Single-Arm vs Multi-Arm Bandits
Single Arm Bandit: Only one arm is available, no choice to be made—simple statistical inference.
Multi-Armed Bandit (MAB): Multiple arms (actions), each with an unknown reward distribution. The challenge is selecting arms over time to maximize total payoff.
Solution Concepts for Bandits

3.1 Asymptotic Correctness

Definition: Guarantees that in the long run (as the number of trials grows large), the algorithm will select the arm with the highest expected payoff.
Perspective: Focuses on correctness eventually but does not care about performance during learning.
Ensures convergence to the optimal arm, but not time efficient.

3.2 Regret Optimality

Definition: Regret measures the loss caused by not always playing the optimal arm. Formal regret over $T$ trials is:

$ \text{Regret}(T) = T\mu^* - \sum_{t=1}^T r_t $

where

$ \mu^* $

is the expected reward of the best arm,

$ r_t $

is reward at trial $t$. - Goal: Minimize regret; strive for the regret to grow as slowly as possible with respect to $T$. - Lower Bound: No algorithm can achieve regret growing slower than

$ \log T $

. - Tradeoff: Limiting exploration can reduce regret in the short term but risks missing optimal arm knowledge, impacting asymptotic optimality.

3.3 PAC (Probably Approximately Correct) Optimality

Definition: With high probability

$ 1-\delta $

, the algorithm identifies an arm whose reward is within

$ \epsilon $

of the best arm's reward. - Formally:

$ P(q^(a) \geq q^(a^*) - \epsilon) \geq 1 - \delta $

where$q^*(a)$is the estimated payoff of arm $a$, and $a^*$ is the optimal arm. - Focus: On sample complexity—how many samples are needed to guarantee this level of confidence. - Guarantee: The algorithm returns an arm approximately close to the best one with high probability. - PAC algorithms often require infinite samples in theory, but research focuses on minimizing sample complexity.

Summary Table of Bandit Solution Perspectives

Solution Concept	Goal	Focus	Performance Measure
Asymptotic Correctness	Eventually select optimal arm	Long-term correctness	Convergence to optimal arm
Regret Optimality	Minimize cumulative regret during learning	Speed of convergence	Regret growth rate (lower bound:$\log T$)
PAC Optimality	Return near-optimal arm with high probability	Sample complexity & confidence	Probability bounds on solution quality

Additional Notes
Regret Optimality balances exploration and exploitation to minimize losses during learning.
Asymptotic Correctness ensures the learning foundation but might not provide good performance early on.
PAC Guarantees offer confidence-based approximations useful when exact optimality is impractical.
Real-world algorithms like Epsilon-Greedy, UCB, and Thompson Sampling aim to negotiate these trade-offs effectively.

Sample Complexity Perspective & Value Function-Based Methods⚓︎

the sample complexity perspective and value function-based methods in reinforcement learning and bandit problems: Here is a detailed notebook-style explanation focused on the sample complexity perspective and value function-based methods in reinforcement learning and bandit problems:

Sample Complexity Perspective
Sample complexity addresses the number of samples or trials needed before the algorithm can confidently identify or approximate the best action.
In multi-armed bandit (MAB) problems, this means how many pulls of each arm are required to distinguish the optimal arm within a given confidence and accuracy.
PAC framework (Probably Approximately Correct): Guarantees that the selected action is within $\epsilon$ of the true best action with at least $1-\delta$ probability.
Sample complexity depends largely on:
The gap between expected rewards of arms ($\Delta$)
Confidence level $\delta$
Problem complexity characterized by reward distributions
Lower and upper bounds on sample complexity guide algorithm design to be efficient in exploration.

Value Function Based Methods
The value function $V$ or action-value function $Q$ estimates the expected cumulative reward for each action.
$Q^*$ represents the true value, which is unknown.
Estimate at time $t$ denoted as $Q_t$ updated from data.

2.1 Greedy and Epsilon-Greedy Policies

Greedy: Always selects the action with the highest estimated value.
Epsilon-Greedy:
With probability $1-\epsilon$, pick the currently best estimated arm (exploit).
With probability $\epsilon$, pick a random arm (explore).
Probability of choosing the best arm at time $t$: $$ 1 - \epsilon + \frac{\epsilon}{n} $$
Probability of choosing each other arm: $ \frac{\epsilon}{n} $
Gradually decrease $\epsilon$ over time so the algorithm exploits more while still guaranteeing exploration.
Too fast decay risks missing the true best arm (loss of correctness guarantee).
Practical approach: keep $\epsilon$ high initially (for better exploration), then slowly reduce it.

2.2 Softmax (Boltzmann) Exploration

Select actions with probabilities proportional to their estimated values.
Probability is computed as: $$ P(a) = \frac{e^{Q_t(a)/\beta}}{\sum_{b} e^{Q_t(b)/\beta}} $$

where $\beta$ is a temperature parameter controlling randomness:- High $\beta$: near uniform random selection (high exploration) - Low $\beta$: approaches greedy policy (exploitation)

Updating Estimates: Stochastic Averaging
Instead of storing all history, maintain running averages: $$ Q_{t+1} = Q_t + \alpha_t \left(R_t - Q_t\right) $$
Step-size $\alpha_t$ controls learning rate.
$R_t$ is the reward received after action at time $t$.
The error term is the difference between observed reward and current estimate.
This form is called stochastic approximation or stochastic averaging.

Stationary vs Non-Stationary Environments
Stationary environment: Reward distributions do not change over time.
Step-size $\alpha_t$ typically decreases ($\alpha_t \to 0$) to ensure convergence.
Non-stationary environment: Reward distributions change over time (e.g., "worn-out coins").
Use a constant step size $\alpha < 1$ to weigh recent rewards more heavily.
Advantage: Allows tracking changes, faster adaptation.
Drawback: May cause oscillations around true value, no strict convergence.
Alpha decay for stationary envs: Gradually reduce $\alpha$ to balance adaptability and stability.
Summary

Concept	Description
Sample Complexity	Number of trials needed to identify near-optimal arm with high confidence.
Epsilon-Greedy	Balances exploration and exploitation via$\epsilon$ parameter.
Softmax Exploration	Probability proportional to estimated values; controlled by temperature$\beta$.
Stochastic Averaging Update	Incremental update of value estimates with step-size$\alpha_t$.
Stationary vs Non-Stationary	Adjust$\alpha_t$ step size depending on reward distribution stability.