Published: 2026-06-01 β€’ Updated: 2026-07-05

Reinforcement Learning (RL) Basics

Interview Preparation Hub for AI/ML Roles

Introduction

Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties and aims to maximize cumulative reward over time. RL is inspired by behavioral psychology and has applications in robotics, gaming, finance, and autonomous systems.

Core Concepts

  • Agent: Learner or decision-maker.
  • Environment: The world the agent interacts with.
  • State: Current situation of the environment.
  • Action: Choice made by the agent.
  • Reward: Feedback signal guiding learning.
  • Policy: Strategy mapping states to actions.
  • Value Function: Expected cumulative reward from a state.

Types of Reinforcement Learning

  • Model-Free RL: Learns directly from experience (Q-learning, SARSA).
  • Model-Based RL: Builds a model of the environment to plan actions.
  • Policy-Based RL: Directly optimizes the policy (Policy Gradient, Actor-Critic).
  • Value-Based RL: Learns value functions to derive policies.

Q-Learning Example

import numpy as np

# Initialize Q-table
Q = np.zeros((state_space, action_space))

# Parameters
alpha = 0.1   # learning rate
gamma = 0.9   # discount factor
epsilon = 0.1 # exploration rate

for episode in range(1000):
    state = env.reset()
    done = False
    while not done:
        if np.random.rand() < epsilon:
            action = env.action_space.sample()  # explore
        else:
            action = np.argmax(Q[state])        # exploit
        next_state, reward, done, _ = env.step(action)
        Q[state, action] = Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
        state = next_state
        

Deep Reinforcement Learning

Deep RL combines reinforcement learning with deep neural networks. Deep Q-Networks (DQN) approximate Q-values using CNNs, enabling agents to play complex games like Atari. Policy Gradient methods optimize policies directly. Actor-Critic models combine value-based and policy-based approaches for stability.

Real-World Applications

  • Robotics (navigation, manipulation)
  • Gaming (AlphaGo, Atari, Chess)
  • Finance (portfolio optimization, trading)
  • Healthcare (treatment planning, drug discovery)
  • Autonomous Vehicles (decision-making, path planning)

Common Mistakes

  • Not balancing exploration vs exploitation.
  • Using sparse rewards without shaping β†’ slow learning.
  • Ignoring discount factor tuning.
  • Overfitting to simulated environments.
  • Neglecting stability issues in deep RL (catastrophic forgetting).

Interview Notes

  • Be ready to explain difference between supervised, unsupervised, and reinforcement learning.
  • Discuss Q-learning vs SARSA.
  • Explain exploration-exploitation trade-off.
  • Know how DQN works and why experience replay is used.
  • Understand policy gradient methods and Actor-Critic models.

Extended Deep Dive

Reinforcement Learning is formalized using Markov Decision Processes (MDPs), defined by states, actions, transition probabilities, and rewards. The agent’s goal is to maximize expected cumulative reward, often discounted over time.

Exploration vs Exploitation: Agents must balance trying new actions (exploration) with leveraging known rewarding actions (exploitation). Techniques like epsilon-greedy and Upper Confidence Bound (UCB) are used.

Policy Gradient: Instead of learning value functions, these methods directly optimize the policy using gradient ascent. Actor-Critic models combine both approaches, improving stability and efficiency.

Challenges: Sample inefficiency, reward sparsity, and stability in training deep RL models. Solutions include reward shaping, curriculum learning, and hybrid approaches.

Summary

Reinforcement Learning is a powerful paradigm for sequential decision-making. Understanding agents, environments, rewards, policies, and value functions is essential for interviews. Candidates should be able to explain Q-learning, deep RL, and policy gradient methods, discuss real-world applications, and address challenges like exploration-exploitation and stability.


Deep Dive Section 1: Foundational Mathematics of Markov Decision Processes (MDPs)

To clear technical loops at premier AI research laboratories, a soft intuitive grasp of sequential decision-making is insufficient. Candidates must navigate the mathematical mechanics of Markovian state tracking and expectation bounds under uncertain environments.

The Formal Definition of an MDP Tuple

Any reinforcement learning problem that can be modeled structurally is framed as a Markov Decision Process. An MDP is mathematically defined by a five-element tuple:

$$\mathcal{M} = \left\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \right\rangle$$

  • $\mathcal{S}$ represents the complete structural set of all valid environment states (the state space).
  • $\mathcal{A}$ signifies the comprehensive action space from which the agent can select operations.
  • $\mathcal{P}$ is the state transition probability function, mapping the likelihood of shifting to state $s'$ given a current state $s$ and action $a$:

    $$\mathcal{P}(s' \mid s, a) = \mathbb{P}\left[S_{t+1} = s' \mid S_t = s, A_t = a\right]$$

  • $\mathcal{R}$ is the reward function, returning an expected scalar reward value based on the state-action transition sequence:

    $$\mathcal{R}(s, a, s') = \mathbb{E}\left[R_{t+1} \mid S_t = s, A_t = a, S_{t+1} = s'\right]$$

  • $\gamma \in [0, 1)$ represents the temporal discount factor, ensuring long-term infinite horizon goals converge mathematically and modeling preference for immediate over deferred returns.

The Markov Property Defined

The critical assumption underlying this mathematics is the **Markov Property**. This property states that the future development of an environment depends solely on its current state representation, making its historical trajectory irrelevant:

$$\mathbb{P}\left[S_{t+1} = s_{t+1}, R_{t+1} = r_{t+1} \mid S_t = s_t, A_t = a_t, S_{t-1} = s_{t-1}, A_{t-1} = a_{t-1}, \dots, S_0 = s_0, A_0 = a_0\right] = \mathbb{P}\left[S_{t+1} = s_{t+1}, R_{t+1} = r_{t+1} \mid S_t = s_t, A_t = a_t\right]$$

[Image diagram of a Markov Decision Process feedback loop detailing agent action generation and environment state-reward feedback mechanisms]

Deep Dive Section 2: Exact Derivation of the Bellman Optimality Equations

Value-based optimization relies on evaluating how good a state or state-action pair is for the agent over time.

The Long-Term Discounted Return Function

The objective of an RL agent is to discover an optimized strategy, or policy $\pi(a \mid s) = \mathbb{P}[A_t = a \mid S_t = s]$, that maximizes the expected cumulative discounted return $G_t$ from any time step $t$ onward:

$$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

Deriving the Bellman Expectation Equation for State Values

Let $V^\pi(s)$ represent the expected return starting from state $s$ and following policy $\pi$ thereafter. We expand this definition recursively by separating the immediate reward from subsequent downstream returns:

$$V^\pi(s) = \mathbb{E}_\pi \left[ G_t \mid S_t = s \right] = \mathbb{E}_\pi \left[ R_{t+1} + \gamma \sum_{k=0}^{\infty} \gamma^k R_{t+k+2} \mid S_t = s \right]$$

$$V^\pi(s) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma G_{t+1} \mid S_t = s \right]$$

Using the law of total expectation, we expand this relationship into explicit transition sums over actions and next states:

$$V^\pi(s) = \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s' \in \mathcal{S}} \mathcal{P}(s' \mid s, a) \left[ \mathcal{R}(s, a, s') + \gamma V^\pi(s') \right]$$

The Bellman Optimality Equations

To find the absolute best policy $\pi^*$, we define the optimal state value function $V^*(s) = \max_\pi V^\pi(s)$ and the optimal action-value function $Q^*(s, a) = \max_\pi Q^\pi(s, a)$. Under the optimal policy, the agent greedily selects the action that maximizes the expected return at each step. This gives rise to the non-linear **Bellman Optimality Equations**:

$$V^*(s) = \max_{a \in \mathcal{A}} Q^*(s, a)$$

$$Q^*(s, a) = \sum_{s' \in \mathcal{S}} \mathcal{P}(s' \mid s, a) \left[ \mathcal{R}(s, a, s') + \gamma \max_{a' \in \mathcal{A}} Q^*(s', a') \right]$$

Because the $\max$ operator introduces a non-linearity, these optimality equations cannot be solved directly using linear algebra. Instead, models must use iterative methods like Value Iteration, Policy Iteration, or sample-driven Temporal Difference learning.

Deep Dive Section 3: Model-Free Control Mechanics β€” Tabular Q-Learning vs. SARSA

When an environment's transition probabilities $\mathcal{P}$ and reward structures $\mathcal{R}$ are unknown, the agent must learn model-free strategies directly from experience. Two foundational algorithms for this are Q-Learning and SARSA.

[Image matrix view displaying a tabular Q-table array alongside update arrows matching off-policy target trajectories]

1. Off-Policy Temporal Difference Optimization: Q-Learning

Q-Learning is an **off-policy** temporal difference algorithm. It estimates the optimal action-value function $Q^*$ regardless of the actions selected by the current behavioral policy. The standard tabular update formula is defined as:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left( R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a) - Q(S_t, A_t) \right)$$

The off-policy nature comes from the target term $\max_{a} Q(S_{t+1}, a)$, which assumes the agent will act perfectly greedily in the next state, even if the behavioral policy (like $\epsilon$-greedy) actually selects an exploratory or sub-optimal action.

2. On-Policy Temporal Difference Optimization: SARSA

In contrast, SARSA (**State-Action-Reward-State-Action**) is an **on-policy** algorithm. It updates its value estimates based on the actual trajectory executed by the behavioral policy:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left( R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right)$$

Because it accounts for the actual next action $A_{t+1}$ chosen by the exploratory policy, SARSA learns value estimates that are sensitive to the risks of exploration. For example, in a dangerous environment like a cliff-world puzzle, SARSA will learn a safer path that avoids edge hazards, whereas Q-Learning will learn the mathematically optimal path along the cliff edge but suffer frequent penalties during exploration.

Deep Dive Section 4: Scaling to Deep Q-Networks (DQN) & Training Stabilizations

Tabular approaches fail when handling complex environments with large or continuous state spaces, such as raw video frames from a video game. Deep Q-Networks (DQN) solve this by using neural networks to approximate action-values: $Q(s, a; \boldsymbol{\theta}) \approx Q^*(s, a)$, where $\boldsymbol{\theta}$ represents the network weights.

[Image workflow of Deep Q-Networks detailing target networks experience buffers and loss minimization streams]

DQN Training Instabilities

Standard deep learning algorithms assume that training samples are independent and identically distributed (i.i.d.). However, in reinforcement learning, consecutive experiences are highly correlated, which can cause deep networks to destabilize or fail to converge. DQN addresses these challenges using two structural techniques:

  • Experience Replay Buffer: The agent saves transition tuples $e_t = (s_t, a_t, r_{t+1}, s_{t+1})$ into a large memory storage pool $\mathcal{D}$. During training, instead of learning from the immediate step, the network samples random, uncorrelated mini-batches from $\mathcal{D}$. This breaks the temporal correlation between consecutive data points and restores the i.i.d. assumption required for stable gradient descent.
  • Target Network Decoupling: If the same network parameters are used to compute both the current prediction and the next-step target estimate, the training target changes constantly with every weight update, leading to feedback loops. DQN resolves this by maintaining two separate models: the online network weights $\boldsymbol{\theta}$ and a dedicated target network $\boldsymbol{\theta}^-$. The target weights remain fixed for a set number of steps, stabilizing the optimization target:

    $$\mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \boldsymbol{\theta}^-) - Q(s, a; \boldsymbol{\theta}) \right)^2 \right]$$

Deep Dive Section 5: Policy Gradient Theorems & Actor-Critic Paradigms

Value-based methods often struggle when applied to continuous action spaces, such as controlling a robot's precise joint angles, because finding the maximum value across infinite actions becomes computationally expensive. Policy Gradient methods address this by parameterizing the policy directly: $\pi_\theta(a \mid s)$.

The Policy Gradient Theorem Derivation

Let $J(\boldsymbol{\theta})$ represent the expected cumulative reward for a parameterized policy, defined across full trajectory rollouts $\tau = (s_0, a_0, s_1, a_1, \dots)$: $J(\boldsymbol{\theta}) = \mathbb{E}_{\tau \sim \pi_\boldsymbol{\theta}}[R(\tau)]$. To optimize this objective using gradient ascent, we calculate the derivative with respect to the policy parameters $\boldsymbol{\theta}$:

$$\nabla_\boldsymbol{\theta} J( \boldsymbol{\theta} ) = \nabla_\boldsymbol{\theta} \int \mathbb{P}(\tau \mid \boldsymbol{\theta}) R(\tau) d\tau = \int \nabla_\boldsymbol{\theta} \mathbb{P}(\tau \mid \boldsymbol{\theta}) R(\tau) d\tau$$

Using the identity $\nabla_\boldsymbol{\theta} f = f \nabla_\boldsymbol{\theta} \log f$, we convert this integral back into an expectation that can be estimated from sampled experiences:

$$\nabla_\boldsymbol{\theta} J( \boldsymbol{\theta} ) = \int \mathbb{P}(\tau \mid \boldsymbol{\theta}) \nabla_\boldsymbol{\theta} \log \mathbb{P}(\tau \mid \boldsymbol{\theta}) R(\tau) d\tau = \mathbb{E}_{\tau \sim \pi_\boldsymbol{\theta}} \left[ \nabla_\boldsymbol{\theta} \log \mathbb{P}(\tau \mid \boldsymbol{\theta}) R(\tau) \right]$$

Expanding the probability of a trajectory shows that the environment's transition functions $\mathcal{P}$ do not depend on the network parameters $\boldsymbol{\theta}$. This simplifies the final derivative into the standard **Policy Gradient Theorem** formula:

$$\nabla_\boldsymbol{\theta} J(\boldsymbol{\theta}) = \mathbb{E}_{s_t, a_t \sim \pi_\boldsymbol{\theta}} \left[ \nabla_\boldsymbol{\theta} \log \pi_\boldsymbol{\theta}(a_t \mid s_t) Q^{\pi_\boldsymbol{\theta}}(s_t, a_t) \right]$$

The REINFORCE Algorithm

The classic **REINFORCE** algorithm estimates this gradient by replacing the true action-value $Q(s_t, a_t)$ with the raw empirical return $G_t$ collected from full sample rollouts. While unbiased, this approach suffers from high variance because identical actions can yield wildly different returns due to random environmental shifts later in the episode.

The Actor-Critic Framework

The **Actor-Critic** framework addresses this high variance by splitting the model into two complementary parts:

  • The Actor: Manages the parameterized policy $\pi_\boldsymbol{\theta}(a \mid s)$ and updates its weights using gradient ascent to encourage actions that yield positive feedback.
  • The Critic: Approximates the environment's state-value function $V_\boldsymbol{\phi}(s)$ using parameters $\boldsymbol{\phi}$. It evaluates the selected actions by calculating the **Advantage Function** $A(s, a)$, which measures whether an action performed better or worse than the network's baseline expectation:

    $$A(s_t, a_t) = Q(s_t, a_t) - V(s_t) \approx R_{t+1} + \gamma V_\boldsymbol{\phi}(S_{t+1}) - V_\boldsymbol{\phi}(S_t)$$

Using the scalar advantage value as a baseline dramatically reduces variance, enabling faster and more stable training for complex tasks.

[Image architecture of an Actor-Critic system highlighting value baseline adjustments and policy gradient updates]

Deep Dive Section 4: Architectural Comparisons and Algorithmic Tradeoffs

Choosing the right reinforcement learning approach requires balancing sample efficiency, structural stability, and action space requirements. The table below outlines these major trade-offs:

Algorithm Class Supported Action Spaces Sample Efficiency Levels Core Engineering Bottlenecks & Vulnerabilities
Tabular Q-Learning Strictly Discrete Spaces Only High (Reuses history directly) Suffers from the curse of dimensionality; cannot scale to large or continuous state representations.
Deep Q-Networks (DQN) Strictly Discrete Spaces Only Moderate (Stabilized by Replay Memory) Prone to overestimating Q-values; requires target network decoupling to avoid divergent training loops.
REINFORCE Policy Gradient Discrete or Continuous Spaces Very Low (Requires fresh data samples) High gradient variance requires using extremely small learning rates to prevent policy collapse.
Advantage Actor-Critic (A2C) Discrete or Continuous Spaces High (Stabilized by Critic baseline updates) Requires managing dual model architectures simultaneously; sensitive to learning rate balances between actor and critic updates.

Deep Dive Section 5: High-Performance Concurrent Java Q-Learning Computational Engine

While reinforcement learning models are typically researched and trained using Python, deploying high-throughput simulation environments or native game backends often requires low-latency Java implementations. The class below provides a production-ready, thread-safe Java engine that executes parallelized tabular updates, controls exploratory epsilon-greedy schedules, and optimizes value tables across multi-threaded worker pools.

import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.locks.ReentrantReadWriteLock;

/**
 * Enterprise multithreaded calculation engine executing concurrent off-policy Q-learning tabular adjustments.
 */
public class HighPerformanceRLComputeEngine {

    private final int availableWorkers;
    private final ExecutorService processingPool;
    
    // Thread-safe tabular storage mapping State IDs to Action-Value arrays
    private final ConcurrentHashMap<Integer, double[]> qTableRepository;
    private final ConcurrentHashMap<Integer, ReentrantReadWriteLock> stateLockRegistry;
    
    private final int totalActionDimensions;
    private final double operationalAlpha; // Learning rate hyperparameter
    private final double computationalGamma; // Temporal discount parameter

    public HighPerformanceRLComputeEngine(int actionDimensions, double alpha, double gamma) {
        this.availableWorkers = Runtime.getRuntime().availableProcessors();
        this.processingPool = Executors.newFixedThreadPool(availableWorkers);
        this.qTableRepository = new ConcurrentHashMap<>();
        this.stateLockRegistry = new ConcurrentHashMap<>();
        this.totalActionDimensions = actionDimensions;
        this.operationalAlpha = alpha;
        this.computationalGamma = gamma;
    }

    /**
     * Confirms state array instantiation under thread isolation boundaries.
     */
    private double[] locateOrInitializeStateRow(int stateId) {
        stateLockRegistry.putIfAbsent(stateId, new ReentrantReadWriteLock());
        return qTableRepository.computeIfAbsent(stateId, k -> new double[totalActionDimensions]);
    }

    /**
     * Resolves the optimal action selection greedily for a given state identity.
     */
    public int retrieveGreedyAction(int stateId) {
        double[] actionValues = locateOrInitializeStateRow(stateId);
        ReentrantReadWriteLock.ReadLock readLock = stateLockRegistry.get(stateId).readLock();
        
        readLock.lock();
        try {
            int optimalActionIndex = 0;
            double peakValue = -Double.MAX_VALUE;
            for (int i = 0; i < actionValues.length; i++) {
                if (actionValues[i] > peakValue) {
                    peakValue = actionValues[i];
                    optimalActionIndex = i;
                }
            }
            return optimalActionIndex;
        } finally {
            readLock.unlock();
        }
    }

    /**
     * Executes an exploration-exploitation action selection step following an epsilon-greedy schedule.
     */
    public int selectExploratoryAction(int stateId, double currentEpsilon) {
        Random randomGenerator = new Random();
        if (randomGenerator.nextDouble() < currentEpsilon) {
            return randomGenerator.nextInt(totalActionDimensions); // Explore random operational trajectory
        }
        return retrieveGreedyAction(stateId); // Exploit known optimal value boundaries
    }

    /**
     * Updates an action-value estimate asynchronously within a thread-safe transaction layout.
     * Implements: Q(s,a) = Q(s,a) + alpha * (reward + gamma * max Q(s',a') - Q(s,a))
     */
    public void submitAsynchronousTransitionUpdate(int state, int action, double reward, int nextState, boolean targetIsTerminal) {
        processingPool.submit(() -> {
            // Step 1: Compute target next-state value baseline using a read lock
            double downstreamMaxTargetQ = 0.0;
            if (!targetIsTerminal) {
                double[] nextStateValues = locateOrInitializeStateRow(nextState);
                ReentrantReadWriteLock.ReadLock nextStateReadLock = stateLockRegistry.get(nextState).readLock();
                
                nextStateReadLock.lock();
                try {
                    for (double val : nextStateValues) {
                        if (val > downstreamMaxTargetQ) {
                            downstreamMaxTargetQ = val;
                        }
                    }
                } finally {
                    nextStateReadLock.unlock();
                }
            }

            // Step 2: Write the updated estimate back to the target state-action index using a write lock
            locateOrInitializeStateRow(state);
            ReentrantReadWriteLock.WriteLock targetStateWriteLock = stateLockRegistry.get(state).writeLock();
            
            targetStateWriteLock.lock();
            try {
                double[] currentValues = qTableRepository.get(state);
                double currentQEstimate = currentValues[action];
                // Apply the tabular Bellman update formula
                currentValues[action] = currentQEstimate + operationalAlpha * (reward + (computationalGamma * downstreamMaxTargetQ) - currentQEstimate);
            } finally {
                targetStateWriteLock.lock(); // Downgrade transaction lock safely
                targetStateWriteLock.unlock();
            }
            return null;
        });
    }

    /**
     * Concurrently runs parallel simulation batches to evaluate training updates across multiple channels.
     */
    public void processBatchExperienceRollouts(List<TransitionRecord> batchList) {
        List<Future<Void>> transactionTasks = new ArrayList<>();
        for (TransitionRecord record : batchList) {
            transactionTasks.add(processingPool.submit(() -> {
                submitAsynchronousTransitionUpdate(record.s, record.a, record.r, record.sPrime, record.isTerminal);
                return null;
            }));
        }
        try {
            for (Future<Void> task : transactionTasks) {
                task.get(); // Synchronize all parallel updates
            }
        } catch (Exception e) {
            throw new RuntimeException("Parallel reinforcement execution thread pool was interrupted", e);
        }
    }

    /**
     * Safely terminates internal execution threads.
     */
    public void terminateEngine() {
        this.processingPool.shutdown();
    }

    /**
     * Data wrapper for passing transition sequences between execution threads.
     */
    public static class TransitionRecord {
        final int s;
        final int a;
        final double r;
        final int sPrime;
        final boolean isTerminal;

        public TransitionRecord(int s, int a, double r, int sPrime, boolean isTerminal) {
            this.s = s;
            this.a = a;
            this.r = r;
            this.sPrime = sPrime;
            this.isTerminal = isTerminal;
        }
    }
}

Conclusion and Next Strategic Steps

Reinforcement Learning provides a powerful framework for training autonomous agents to solve complex sequential decision-making tasks without requiring direct supervision. By using the Bellman equations to update value estimates or applying policy gradient methods to optimize strategies directly, networks can learn complex behaviors through trial-and-error interaction with their environment.

To see how these reward-driven optimization methods can be applied to train large language models, proceed to our next advanced module: Alignment Strategies β€” Proximal Policy Optimization (PPO) and Reinforcement Learning from Human Feedback (RLHF). There, we will analyze the complete optimization algorithms used to align foundational neural models with human intent. Keep coding!

About the Author

Naresh Kumar

Naresh Kumar

Senior Java Backend Engineer experienced in Banking, Payments, ISO 20022, Spring Boot, Microservices, Kafka, Docker, Kubernetes, AWS and Cloud Native Systems.

Built enterprise payment solutions, transaction processing systems, API platforms and scalable microservices used in production.

LinkedIn Profile