Terminator: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

1UT Austin    2EPFL    3ENS Paris-Saclay    4Télécom Paris   
TL;DR: Terminator cuts LLM reasoning length by 14%–55%, halves wall-clock latency, and achieves best or second-best performance on 28 of 32 metrics across four models and four benchmarks, all without fine-tuning the base model.
Figure 1: Early Stopping via Terminator. Side-by-side comparison of Terminator-Qwen3-8B (left) vs. vanilla Qwen3-8B (right) on the following MATH-500 question: "Define $p = \sum_{k=1}^\infty \frac{1}{k^2}$ and $q = \sum_{k=1}^\infty \frac{1}{k^3}.$ Find a way to write $\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3}$ in terms of $p$ and $q.$"

Abstract

Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. We precisely address this and design Terminator, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning Terminator is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train Terminator. Powered by this approach, Terminator achieves significant reductions in CoT lengths of 14% – 55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, whilst outperforming current state-of-the-art methods.

Motivation

Large Reasoning Models (LRMs) frequently overthink: even after generating the correct final answer early in their Chain-of-Thought, they continue to reason for hundreds or thousands of additional tokens, double-checking, exploring alternatives, and sometimes even changing to a wrong answer.

A natural question follows: can we detect when the LRM has already generated its final answer? We find that the answer is yes. By aligning many CoTs to the position where the final answer first appears and averaging, a clear confidence spike emerges at exactly that position. Individual CoT signals are noisy (Figure 2, left), but event-locked averaging across 3,200 CoTs from math, science, and coding problems reveals a consistent and sharp signal (Figure 2, right).

Event-locked averaging of Token-Confidence and log-probability signals
Figure 2: Event-Locked Averaging of Token-Confidence. Left: single-sample trajectories are noisy. Right: averaging many CoTs aligned by the first answer position reveals a clear confidence spike, consistent across data sources.

While this averaged signal clearly marks the answer position, using it directly during online inference is not straightforward: it requires multiple CoTs and knowledge of the answer position, which is the very thing we are trying to predict. This motivates a learned approach: training a probe on the LRM's hidden states to predict, token by token, whether the final answer has already been generated.

↑ Back to sections

Key Results

Table 1 shows that Terminator achieves competitive performance compared to existing methods. We provide results for two additional models, Ministral-3-8B-Reasoning-2512 and Ministral-3-14B-Reasoning-2512, in the table in our paper. Figure 3 shows the Pareto frontier on each dataset using the same data provided in Table 1. Notably, Terminator lies on or near the Pareto frontier for all datasets and all models.

Table 1: Performance of Terminator and Baselines. ↑ = higher is better, ↓ = lower is better. CR = compression rate (mean per-sample). Tok = mean tokens per sample. Bold and underlined values highlight the best and second-best early exit methods. Terminator demonstrates superior accuracy-efficiency trade-offs (best or second-best performance across 28 out of 32 metrics).
Method Math Coding Science Overall
MATH-500 AIME 2025 HumanEval GPQA
Acc↑Tok↓CR↓ Acc↑Tok↓CR↓ Acc↑Tok↓CR↓ Acc↑Tok↓CR↓ Acc↑CR↓
Qwen3-8B
Vanilla91.1%5,037100%74.4%14,499100%86.4%3,792100%57.6%8,594100%77.4%100%
NoThinking80.7%80916.1%22.0%2,35518.6%78.5%35311.8%30.7%1,20415.8%53.0%15.6%
DEER79.9%2,60252.0%21.4%10,34967.8%77.4%3,27583.6%50.5%8,55399.6%57.3%75.8%
Thought-Calib90.1%4,37293.9%65.8%11,01481.5%71.8%3,26792.9%52.6%6,24078.9%70.1%86.8%
Dynasor78.3%1,85041.0%48.0%7,47948.8%79.3%2,88378.4%43.2%2,45528.4%62.2%49.2%
Terminator90.7%2,42545.1%69.4%10,97070.7%82.9%2,71669.9%52.1%7,54385.7%72.6%67.8%
Qwen3-14B
Vanilla92.0%4,598100%79.9%14,255100%76.0%3,296100%59.9%7,628100%77.0%100%
NoThinking84.1%78617.5%26.3%2,47219.9%78.3%31712.2%32.1%1,26518.8%55.2%17.1%
DEER80.9%2,50156.2%27.6%10,49771.0%80.7%2,96187.3%49.0%7,45197.4%59.6%78.0%
Thought-Calib89.8%3,77892.0%63.3%9,42971.3%74.8%2,58287.1%54.6%5,75781.9%70.6%83.1%
Dynasor79.6%1,70242.4%61.8%7,93752.8%83.9%2,61182.2%45.6%2,10129.1%67.7%51.6%
Terminator90.7%2,26146.8%74.2%10,78771.0%83.3%2,35870.9%53.9%6,79887.1%78.0%65.0%
Pareto frontiers of accuracy versus compression rate across reasoning models and benchmarks
Figure 3: Pareto Frontiers. Accuracy versus compression rate across four reasoning models (Qwen3-8B, Qwen3-14B, Ministral-8B, Ministral-14B) and four benchmarks (MATH-500, AIME25, HumanEval, GPQA). Each point represents a method's accuracy and compression rate, with lower compression rates indicating greater token savings. The dashed line traces the Pareto frontier connecting non-dominated solutions. Terminator consistently achieves strong Pareto efficiency, offering the best accuracy-efficiency tradeoff across models and tasks.

Latency Analysis

We benchmark Terminator's latency and throughput using a vLLM-compatible implementation on MATH-500 (batch size 1, single GH200). Terminator more than halves the average latency with only a small throughput overhead of 10.8% for Qwen3-8B and 7.5% for Qwen3-14B. Because Terminator's architecture (a single transformer layer and an FFN) remains fixed, the overhead shrinks proportionally as the base LRM grows.

Table 2: Latency Analysis. Latency and throughput benchmarks on MATH-500 (batch size 1) for Qwen3-based vanilla and Terminator models. Values are mean ± 95% CI.
Method Latency (s) Throughput (tok/s)
Qwen3-8B
Vanilla32.68 ± 9.59151.5 ± 4.4
Terminator14.10 ± 6.27135.2 ± 2.0
Qwen3-14B
Vanilla43.38 ± 13.9898.0 ± 2.0
Terminator18.76 ± 6.5290.6 ± 0.8
↑ Back to sections

Method

Terminator is a single-layer transformer that sits on top of a base LRM and injects the end-of-thought token when the LRM has generated its final answer. Terminator accepts the final hidden-states of the LRM as input, and predicts a 0 if the answer has not been generated yet, or 1 if it has.

Method diagram
Figure 4: Terminator Architecture. Terminator predicts when the LRM has generated its final answer.
Training Details

Terminator is trained to predict the exact position of the first occurrence of the final answer of an LRM's CoT during online inference. An important challenge we overcome is to identify the exact position of the earliest final answer occurrence for a given CoT. Our training dataset is curated using the following pipeline for robust extraction of the answer position:

  1. Collect CoTs from the LRM
  2. For each CoT, we use a separate LRM to analyze the CoT for the final answer and extract its exact position
  3. Given the token position index of the answer, we construct the labels: set all positions before the answer to 0 and all positions after to 1

Finally, Terminator is trained with binary cross-entropy loss over each position in the CoT. By using the above training strategy, Terminator learns to exit when the LRM has generated the final answer it will eventually commit to.

Inference Details

At inference time, Terminator makes a prediction for each generated token by the LRM. Terminator forces the end-of-thought token to terminate reasoning early by using the following rules:

  • Predict 1 if the predicted confidence is above a fixed threshold (set to 0.7 by default)
  • Count the number of predicted 1s in a sliding window of the most recent predictions (the window size is 10 by default)
  • If the majority (>5 if the window size is 10) of predictions are 1, then the CoT is terminated
↑ Back to sections

Additional Results

Below are some additional results that we wish to highlight.

Thinking Token Frequency Shift

"Thinking tokens" such as wait, hmm, okay, and alternatively are associated with ongoing reasoning. We find that they exhibit measurable frequency shifts once the final answer is generated. The plots below show token rates before vs. after the answer for three such tokens. For example, hmm and okay occur more frequently before the answer, while another occurs more frequently after. Dot size reflects relative CoT length. Similar plots with different "thinking tokens" are shown in the appendix of our paper.

Token usage frequency shift before and after final answer
Figure 5: Token Usage Frequency Shift. "Thinking token" rates change depending on whether the final answer has been generated. Dot size reflects relative CoT length.
Example 1: MATH-500

For easier problems like those in MATH-500, Terminator shows sharp transitions in predicted confidence at the exiting threshold, with good separation between the "still reasoning" and "answer generated" regimes. By manual inspection, we observe that there is a clearer transition to overthinking on these problems, which Terminator detects well.

Predicted probabilities for four MATH-500 samples
Figure 6: Predicted Probabilities for MATH-500. Terminator's predicted probability stream for early-exiting on four randomly chosen samples from MATH-500.

Figure 7 zooms in on a single MATH-500 sample, showing the full CoT with Terminator's predicted probabilities overlaid. The predicted probability remains low throughout the reasoning phase and rises sharply once the final answer has been generated, triggering the early exit.

Predicted probabilities for a single MATH-500 sample
Figure 7: Predicted Probabilities for MATH-500. Terminator's predicted probabilities for early-exit on a randomly chosen sample from MATH-500. The beginning and the end are truncated for better visibility.
Example 2: AIME 2025

For harder problems like those in AIME 2025, the transition from productive reasoning to overthinking is less obvious, and Terminator's predicted probabilities do not show the same sharp transition seen on MATH-500. This is consistent with the observation that identifying a good exit position is more challenging on very hard tasks.

Predicted probabilities for four AIME 2025 samples
Figure 8: Predicted Probabilities for AIME 2025. Terminator's predicted probability stream for early-exiting on four randomly chosen samples from AIME 2025.

Figure 9 shows a case where Terminator struggles to find a clean exit position. The predicted probabilities oscillate near the threshold rather than making a decisive jump, reflecting the difficulty of the underlying problem. Despite this, Terminator still achieves strong performance on AIME 2025 overall.

Predicted probabilities for a single AIME 2025 sample
Figure 9: Predicted Probabilities for AIME 2025. Terminator's predicted probabilities for early-exit on a randomly chosen sample from AIME 2025. The beginning and the end are truncated for better visibility.
↑ Back to sections

Demo 🚀

Try Terminator yourself! We provide ready-to-run model packages on HuggingFace with a high-performance vLLM server and a standalone inference script.

Available Models

Model Base Model Min. VRAM
Terminator-Qwen3-8B Qwen3-8B ~24 GB
Terminator-Qwen3-14B Qwen3-14B ~40 GB

Quick Start

# Clone a model repo (requires Git LFS: https://git-lfs.com)
git lfs install
git clone https://huggingface.co/acnagle/Terminator-Qwen3-8B
cd Terminator-Qwen3-8B

# Automated setup (creates conda env, installs vLLM, downloads base model)
./setup.sh

# Start the vLLM server
./start_server.sh

# In another terminal, chat with the model
python client.py --interactive
Standalone Inference (No Server)

For quick testing without spinning up a vLLM server, use the HuggingFace-native inference script included in each model repo:

python inference_hf.py --prompt "What is the sum of the first 100 natural numbers?"

Thinking content is streamed in dimmed text; the final answer is shown in bold. For best performance, the vLLM server is recommended.

Using the API Directly

The vLLM server exposes an OpenAI-compatible API, so you can use any OpenAI client library:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Terminator-Qwen3-8B",
    messages=[{"role": "user", "content": "What is 25 * 37?"}],
    temperature=0.6,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

# Thinking content (chain-of-thought)
print(response.choices[0].message.reasoning)

# Final answer
print(response.choices[0].message.content)
↑ Back to sections

Citation

@misc{nagle2026terminatorlearningoptimalexit,
  title   = {TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning},
  author  = {Alliot Nagle and Jakhongir Saydaliev and Dhia Garbaya and Michael Gastpar and Ashok Vardhan Makkuva and Hyeji Kim},
  year    = {2026},
  eprint  = {2603.12529},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url     = {https://arxiv.org/abs/2603.12529}
}
↑ Back to sections