Reinforcement Learning Basics | Interview Prep Hub

Reinforcement Learning (RL) Basics

Interview Preparation Hub for AI/ML Roles

Introduction

Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties and aims to maximize cumulative reward over time. RL is inspired by behavioral psychology and has applications in robotics, gaming, finance, and autonomous systems.

Core Concepts

  • Agent: Learner or decision-maker.
  • Environment: The world the agent interacts with.
  • State: Current situation of the environment.
  • Action: Choice made by the agent.
  • Reward: Feedback signal guiding learning.
  • Policy: Strategy mapping states to actions.
  • Value Function: Expected cumulative reward from a state.

Types of Reinforcement Learning

  • Model-Free RL: Learns directly from experience (Q-learning, SARSA).
  • Model-Based RL: Builds a model of the environment to plan actions.
  • Policy-Based RL: Directly optimizes the policy (Policy Gradient, Actor-Critic).
  • Value-Based RL: Learns value functions to derive policies.

Q-Learning Example

import numpy as np

# Initialize Q-table
Q = np.zeros((state_space, action_space))

# Parameters
alpha = 0.1   # learning rate
gamma = 0.9   # discount factor
epsilon = 0.1 # exploration rate

for episode in range(1000):
    state = env.reset()
    done = False
    while not done:
        if np.random.rand() < epsilon:
            action = env.action_space.sample()  # explore
        else:
            action = np.argmax(Q[state])        # exploit
        next_state, reward, done, _ = env.step(action)
        Q[state, action] = Q[state, action] + alpha * (reward + gamma * np.max(Q[next_state]) - Q[state, action])
        state = next_state
    

Deep Reinforcement Learning

Deep RL combines reinforcement learning with deep neural networks. Deep Q-Networks (DQN) approximate Q-values using CNNs, enabling agents to play complex games like Atari. Policy Gradient methods optimize policies directly. Actor-Critic models combine value-based and policy-based approaches for stability.

Real-World Applications

  • Robotics (navigation, manipulation)
  • Gaming (AlphaGo, Atari, Chess)
  • Finance (portfolio optimization, trading)
  • Healthcare (treatment planning, drug discovery)
  • Autonomous Vehicles (decision-making, path planning)

Common Mistakes

  • Not balancing exploration vs exploitation.
  • Using sparse rewards without shaping β†’ slow learning.
  • Ignoring discount factor tuning.
  • Overfitting to simulated environments.
  • Neglecting stability issues in deep RL (catastrophic forgetting).

Interview Notes

  • Be ready to explain difference between supervised, unsupervised, and reinforcement learning.
  • Discuss Q-learning vs SARSA.
  • Explain exploration-exploitation trade-off.
  • Know how DQN works and why experience replay is used.
  • Understand policy gradient methods and Actor-Critic models.

Extended Deep Dive

Reinforcement Learning is formalized using Markov Decision Processes (MDPs), defined by states, actions, transition probabilities, and rewards. The agent’s goal is to maximize expected cumulative reward, often discounted over time.

Exploration vs Exploitation: Agents must balance trying new actions (exploration) with leveraging known rewarding actions (exploitation). Techniques like epsilon-greedy and Upper Confidence Bound (UCB) are used.

Policy Gradient: Instead of learning value functions, these methods directly optimize the policy using gradient ascent. Actor-Critic models combine both approaches, improving stability and efficiency.

Challenges: Sample inefficiency, reward sparsity, and stability in training deep RL models. Solutions include reward shaping, curriculum learning, and hybrid approaches.

Summary

Reinforcement Learning is a powerful paradigm for sequential decision-making. Understanding agents, environments, rewards, policies, and value functions is essential for interviews. Candidates should be able to explain Q-learning, deep RL, and policy gradient methods, discuss real-world applications, and address challenges like exploration-exploitation and stability.

© 2026 Interview Prep Hub