Published: 2026-06-01 • Updated: 2026-06-07

Key Metrics for AI Systems: Latency, Throughput, and Error Rates

In traditional software systems, monitoring application performance is relatively straightforward. We track CPU usage, memory consumption, database query times, and standard HTTP response codes. However, when we transition to AI-driven systems—especially those powered by Large Language Models (LLMs) and complex machine learning pipelines—traditional monitoring metrics fall short.

AI workloads are computationally expensive, non-deterministic, and highly variable in execution time. A single LLM request can take anywhere from a few milliseconds to several minutes depending on the prompt length, model size, and generation parameters. To maintain a reliable, high-performing AI application, you must master the three foundational pillars of AI observability: Latency, Throughput, and Error Rates.

The AI Request Lifecycle and Metric Collection Points

To understand where these metrics originate, let us look at the lifecycle of a typical streaming AI request processed through a Java enterprise backend:

[User Client] 
     │
     ▼ (1. Request Sent)
[Java Spring Boot API Gateway] ─── (Track: Requests Per Second - RPS)
     │
     ▼ (2. Prompt Sent to LLM)
[AI Model Provider / Local LLM]
     │
     ├─► (3. First Token Received) ─── (Track: Time to First Token - TTFT)
     │
     ├─► (4. Stream of Tokens)     ─── (Track: Tokens Per Second - TPS)
     │
     ▼ (5. Complete Response)
[Java Spring Boot API Gateway] ─── (Track: Total Latency / Error Rates)
     │
     ▼ (6. Final Render)
[User Client]

1. Latency: Measuring the Speed of AI Responses

Latency is the time taken to process a request and return a response. In traditional web applications, latency is measured from the moment a request is received to the moment the response is fully sent. In generative AI systems, this single metric is insufficient because users expect real-time streaming interfaces.

To accurately monitor AI latency, we must break it down into three distinct metrics:

  • Time to First Token (TTFT): The duration between sending the prompt and receiving the very first token of the response. This is the most critical metric for user experience, as it determines how "responsive" the AI feels.
  • Time per Output Token (TPOT): The average time required to generate each subsequent token. A low TPOT ensures a smooth, readable stream of text for the end-user.
  • Total Latency (Turnaround Time): The overall time elapsed from the initial request to the final token generation. This is critical for non-streaming batch jobs, such as automated document summarization.

Implementing Latency Tracking in Java

In a Java application using Spring AI or Micrometer, tracking these metrics requires intercepting the stream. Below is a practical example showing how to measure both TTFT and Total Latency using Micrometer timers in a Java service:

import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Service;
import reactor.core.publisher.Flux;

import java.time.Duration;
import java.time.Instant;
import java.util.concurrent.atomic.AtomicBoolean;

@Service
public class AiMonitoringService {

    private final MeterRegistry meterRegistry;

    public AiMonitoringService(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    public Flux<String> streamAiResponseWithMetrics(Flux<String> tokenStream) {
        Instant startTime = Instant.now();
        AtomicBoolean firstTokenReceived = new AtomicBoolean(false);

        return tokenStream
            .doOnNext(token -> {
                if (firstTokenReceived.compareAndSet(false, true)) {
                    long ttftMs = Duration.between(startTime, Instant.now()).toMillis();
                    Timer.builder("ai.latency.ttft")
                         .description("Time to First Token")
                         .register(meterRegistry)
                         .record(Duration.ofMillis(ttftMs));
                }
            })
            .doOnComplete(() -> {
                long totalLatencyMs = Duration.between(startTime, Instant.now()).toMillis();
                Timer.builder("ai.latency.total")
                     .description("Total Turnaround Latency")
                     .register(meterRegistry)
                     .record(Duration.ofMillis(totalLatencyMs));
            });
    }
}

2. Throughput: Measuring System Processing Capacity

Throughput defines the volume of work your AI system can handle in a given timeframe. In standard web services, throughput is measured in Requests Per Second (RPS). While RPS is still useful for scaling web servers, it is a poor metric for capacity planning in LLMs.

Because LLM processing costs and execution times scale with the number of input and output tokens, AI throughput must be measured using token-centric metrics:

  • Tokens Per Second (TPS): The total number of tokens (both input prompts and output generations) processed by your system per second. This is the ultimate metric for understanding model server utilization.
  • Input vs. Output Token Ratio: Tracking the volume of input tokens versus generated output tokens. Input tokens are cheaper and faster to process (pre-fill phase) than output tokens (generation phase).
  • Requests Per Second (RPS): Useful for monitoring gateway load, connection pools, and rate-limiting thresholds.

If your system processes large PDF documents, a low RPS might look like a healthy, idle system. However, if those few requests contain millions of tokens, your underlying LLM infrastructure could be near exhaustion. Always prioritize TPS over RPS for AI scaling policies.

3. Error Rates: Understanding System and Semantic Failures

Error tracking in AI systems is uniquely complex. We must categorize errors into two distinct buckets: System Errors and Semantic Errors.

System Errors (Infrastructure Failures)

These are standard software engineering failures that prevent a request from completing. They are easily tracked using traditional HTTP status codes and exception logging:

  • HTTP 429 (Too Many Requests): Indicates that your application has exceeded the rate limits of your LLM provider.
  • HTTP 503 / 504 (Service Unavailable / Gateway Timeout): Occurs when the model provider is overloaded or the model inference takes longer than your configured HTTP timeout.
  • Out Of Memory (OOM) Errors: Common when self-hosting open-source models (like Llama or Mistral) on GPU clusters when context windows overflow.

Semantic Errors (Behavioral Failures)

These are cases where the system returns an HTTP 200 OK status, but the output is fundamentally broken, unsafe, or useless. Monitoring semantic errors requires specialized guardrails:

  • Guardrail Violations: When input prompts or output responses trigger safety filters (e.g., hate speech, prompt injection attempts).
  • Empty or Truncated Responses: Occurs when a model hits maximum token limits prematurely or encounters generation issues.
  • Format Violations: When your application expects structured data (like JSON) but the model outputs plain text, causing parsing failures in down-stream Java code.

Real-World Use Cases

Use Case 1: E-Commerce AI Chatbot

An e-commerce company uses a streaming LLM to answer customer support queries. For this use case, Time to First Token (TTFT) is the most critical metric. If TTFT exceeds 2 seconds, customers perceive the chatbot as laggy and abandon the chat. By monitoring TTFT, the engineering team can set up auto-scaling rules to spin up more model instances when average TTFT exceeds 1.5 seconds.

Use Case 2: Financial Document Batch Processor

A financial institution processes thousands of PDF reports overnight to extract risk metrics. For this offline batch system, TTFT is irrelevant. The critical metrics are Tokens Per Second (TPS) and Error Rates (specifically HTTP 429 rate limits). By maximizing TPS while carefully monitoring 429 errors, the Java batch processing engine can dynamically adjust its concurrency limits to complete the processing window before the markets open.

Common Mistakes to Avoid

  • Mistake 1: Relying on Average Latency: AI latency has a long-tail distribution. Averages hide extreme outliers. Always monitor percentiles: p90, p95, and p99 latency. If your p99 latency is 15 seconds, it means 1% of your users are experiencing unacceptable delays, even if your average latency is a comfortable 1.5 seconds.
  • Mistake 2: Ignoring Token Counts in Rate Limiting: Implementing rate limits based strictly on IP requests can lead to system crashes. A single user sending a massive prompt can consume more GPU memory than a hundred users sending short greetings. Rate-limit by token consumption, not just connection count.
  • Mistake 3: Treating Guardrail Blocks as Successful Responses: When an AI safety guardrail blocks a response, it often returns a standard HTTP 200 with a generic "I cannot answer that" message. If you do not parse and tag these responses specifically as semantic errors, your monitoring dashboards will falsely show a 100% success rate while your users are getting blocked continuously.

Interview Notes (For Senior Developers and Architects)

  • Question: How does the pre-fill phase of an LLM affect latency compared to the generation phase?
  • Answer: The pre-fill phase processes the input prompt. It is highly parallelizable on the GPU, meaning processing a large prompt takes relatively little time per token. The generation phase generates output tokens sequentially (one by one), which is memory-bandwidth bound and significantly slower. Therefore, long input prompts increase TTFT slightly, but long output generations drastically increase total turnaround latency.
  • Question: How would you design a resilient Java client to handle HTTP 429 (Rate Limit) errors from an external LLM API?
  • Answer: I would implement an exponential backoff retry strategy with jitter, specifically looking for the Retry-After header in the HTTP response. In a reactive Spring Boot application, this can be achieved using Project Reactor's Retry.backoff() utility, combined with circuit breakers (like Resilience4j) to temporarily trip and route traffic to a fallback model if rate limits are consistently violated.
  • Question: Why is monitoring GPU memory usage critical for local LLM inference hosting?
  • Answer: Unlike traditional CPU-bound Java applications where memory can be swapped to disk, GPUs do not support standard virtual memory swapping without massive performance penalties. If the combined size of the model weights and the active request context windows exceeds physical VRAM, the system will crash immediately with an Out-of-Memory (OOM) error. Monitoring active context sizes and active batch sizes is critical to prevent VRAM overflow.

Summary

Monitoring AI systems requires a paradigm shift from traditional infrastructure monitoring. By focusing on the core metrics of Latency (broken down into TTFT and total latency), Throughput (measured via Tokens Per Second), and Error Rates (categorized into system and semantic failures), you can ensure your AI applications remain fast, reliable, and cost-efficient. As you build out your observability stack, remember to monitor percentiles rather than averages, and always design your Java applications to gracefully handle the unique failure modes of generative AI.

About the Author

Naresh Kumar

Naresh Kumar

Senior Java Backend Engineer experienced in Banking, Payments, ISO 20022, Spring Boot, Microservices, Kafka, Docker, Kubernetes, AWS and Cloud Native Systems.

Built enterprise payment solutions, transaction processing systems, API platforms and scalable microservices used in production.

LinkedIn Profile