Published: 2026-06-01 โ€ข Updated: 2026-06-01

Creating Code-Executing and Self-Debugging AI Agents

In the evolution of autonomous AI agents, the ability to generate code was only the first step. True autonomy is achieved when an agent can execute the code it writes, observe the runtime results, analyze errors, and debug its own code without human intervention. This closed-loop system transforms a passive text generator into an active problem-solving engine.

This guide will walk you through the architecture, implementation, and security considerations of building code-executing and self-debugging AI agents using Python.

The Self-Debugging Architecture

A self-debugging agent operates in a continuous loop. Instead of assuming its generated code is correct, it treats the code as a hypothesis, tests it in a runtime environment, and uses the feedback to refine its approach. This process closely mirrors how human software engineers write and debug software.

+-------------------------------------------------------------+
|                      User Requirement                       |
+-------------------------------------------------------------+
                               |
                               v
+-------------------------------------------------------------+
|                 LLM Generates Python Code                   |
+-------------------------------------------------------------+
                               |
                               v
+-------------------------------------------------------------+
|              Execute Code in Sandbox Environment            |
+-------------------------------------------------------------+
                               |
            +------------------+------------------+
            |                                     |
            v (Success)                           v (Runtime Error)
+------------------------+            +------------------------+
| Return Final Result to |            | Capture Traceback and  |
|         User           |            |    Execution Context   |
+------------------------+            +------------------------+
                                                  |
                                                  v
                                      +------------------------+
                                      |   LLM Analyzes Error   |
                                      |   and Generates Fix    |
                                      +------------------------+
                                                  |
                                                  +-- (Loop back to Exec)
    

Building a Safe Execution Environment

Before allowing an LLM to run arbitrary code, we must build an execution harness. Running model-generated code directly on your host operating system is highly dangerous. While production environments should use Docker containers or microVMs, we can build a basic, sandboxed execution harness in Python using the sys and io modules to capture standard output and handle exceptions.

Below is the implementation of a secure-ready local execution engine that runs code and captures both outputs and errors.

import sys
import io
import traceback

def execute_python_code(code_to_run: str, timeout_seconds: int = 5) -> dict:
    """
    Executes a string of Python code and captures the standard output,
    standard error, and any raised exceptions.
    """
    output_buffer = io.StringIO()
    error_buffer = io.StringIO()
    
    # Save the original standard output and error
    original_stdout = sys.stdout
    original_stderr = sys.stderr
    
    # Redirect output streams
    sys.stdout = output_buffer
    sys.stderr = error_buffer
    
    execution_result = {
        "success": False,
        "output": "",
        "error": ""
    }
    
    # Define a restricted global environment
    safe_globals = {
        "__builtins__": __builtins__,
        "import": __import__
    }
    local_variables = {}
    
    try:
        # Execute the code block
        exec(code_to_run, safe_globals, local_variables)
        execution_result["success"] = True
    except Exception as e:
        # Capture the traceback if an exception occurs
        traceback.print_exc(file=error_buffer)
        execution_result["success"] = False
    finally:
        # Restore original streams
        sys.stdout = original_stdout
        sys.stderr = original_stderr
        
    execution_result["output"] = output_buffer.getvalue()
    execution_result["error"] = error_buffer.getvalue()
    
    return execution_result
    

Implementing the Self-Debugging Loop

Now that we have an execution engine, we can build the agent loop. The agent will attempt to solve a problem, run its code, and if it fails, send the code and the error traceback back to the LLM with instructions to fix it. We will limit the loop to a maximum number of retries to prevent infinite execution loops.

class SelfDebuggingAgent:
    def __init__(self, llm_client, max_retries: int = 3):
        self.llm = llm_client
        self.max_retries = max_retries

    def solve_problem(self, task_description: str) -> str:
        # Step 1: Generate initial code
        prompt = f"Write Python code to solve the following task: {task_description}. Return ONLY raw Python code inside a markdown block."
        generated_code = self.llm.generate(prompt)
        cleaned_code = self._extract_code(generated_code)
        
        for attempt in range(1, self.max_retries + 1):
            print(f"--- Attempt {attempt} ---")
            print(f"Running code:\n{cleaned_code}\n")
            
            # Step 2: Execute code
            result = execute_python_code(cleaned_code)
            
            if result["success"]:
                print("Execution Successful!")
                return result["output"]
            
            # Step 3: Debug if execution failed
            print(f"Execution Failed with error:\n{result['error']}")
            if attempt == self.max_retries:
                break
                
            debugging_prompt = (
                f"The following Python code failed execution:\n\n"
                f"```python\n{cleaned_code}\n```\n\n"
                f"It produced the following error traceback:\n\n"
                f"{result['error']}\n\n"
                f"Analyze the error, fix the bug, and provide the updated Python code."
            )
            
            generated_code = self.llm.generate(debugging_prompt)
            cleaned_code = self._extract_code(generated_code)
            
        raise RuntimeError("Agent failed to self-debug within the retry limit.")

    def _extract_code(self, raw_response: str) -> str:
        # Helper to strip markdown code block formatting if present
        if "```python" in raw_response:
            return raw_response.split("```python")[1].split("```")[0].strip()
        elif "```" in raw_response:
            return raw_response.split("```")[1].split("```")[0].strip()
        return raw_response.strip()
    

Real-World Use Cases

  • Automated Data Science and Analytics: Agents can write Pandas code to clean datasets, run statistical tests, and generate charts. If a column name is misspelled or a data type is mismatched, the agent catches the exception, inspects the schema, and automatically corrects the code.
  • Dynamic Web Scraping: Web structures change constantly. A scraping agent can attempt to extract data using a specific XPath or CSS selector. If the element is not found, the agent can capture the page source, analyze the document structure, and rewrite its selector on the fly.
  • API Integration and Testing: When integrating with third-party APIs, payloads can change. A self-debugging agent can attempt API requests, capture error responses (such as HTTP 400 Bad Request), modify the payload structure based on the API's error message, and retry the request.

Common Mistakes and How to Avoid Them

  • Executing Code in Unsafe Environments: Running model-generated code directly on production servers can lead to catastrophic data loss or security breaches. Solution: Always run executed code inside isolated sandboxes, such as Docker containers, gVisor runtimes, or WASM environments.
  • Infinite Self-Debugging Loops: If an LLM gets stuck on a logical error, it can enter an infinite loop of generating and running broken code, consuming massive amounts of API tokens. Solution: Implement a strict max_retries limit and track token usage.
  • Ignoring Environment Dependencies: The agent might write code that imports external libraries (like requests or numpy) that are not installed in the local execution environment. Solution: Provide the agent with a list of pre-installed libraries in its system prompt, or allow it to install packages dynamically inside its sandbox.

Interview Notes for AI Engineers

  • How do you secure code-executing agents? Be prepared to discuss containerization (Docker), lightweight microVMs (Firecracker), and read-only file systems. Mention disabling network access unless explicitly required by the task.
  • How do you handle state persistence across debugging steps? Explain that the execution environment should keep track of the workspace state, or the agent must regenerate the entire script from scratch during each debugging iteration to ensure reproducibility.
  • What is the role of traceback parsing? Raw tracebacks are highly informative to LLMs. Instead of just saying "your code failed", sending the exact line number and exception type (e.g., KeyError, IndexError) significantly increases the model's debugging success rate.

Summary

Creating code-executing and self-debugging agents shifts the paradigm of AI development from static code generation to dynamic runtime execution. By wrapping code generation in a safe execution harness, capturing traceback errors, and feeding those errors back to the model, we build highly resilient agents capable of solving complex, multi-step engineering tasks autonomously.

About the Author

Naresh Kumar

Naresh Kumar

Senior Java Backend Engineer experienced in Banking, Payments, ISO 20022, Spring Boot, Microservices, Kafka, Docker, Kubernetes, AWS and Cloud Native Systems.

Built enterprise payment solutions, transaction processing systems, API platforms and scalable microservices used in production.

LinkedIn Profile