Terminator: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

1UT Austin    2EPFL    3ENS Paris-Saclay    4Télécom Paris   
TL;DR: Terminator cuts LLM reasoning length by 14%–55%, halves wall-clock latency, and achieves best or second-best performance on 28 of 32 metrics across four models and four benchmarks, all without fine-tuning the base model.
Figure 1: Early Stopping via Terminator. Side-by-side comparison of Terminator-Qwen3-8B (left) vs. vanilla Qwen3-8B (right) on the following MATH-500 question: "Define $p = \sum_{k=1}^\infty \frac{1}{k^2}$ and $q = \sum_{k=1}^\infty \frac{1}{k^3}.$ Find a way to write $\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3}$ in terms of $p$ and $q.$"

Abstract

Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. We precisely address this and design Terminator, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning Terminator is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train Terminator. Powered by this approach, Terminator achieves significant reductions in CoT lengths of 14% – 55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, whilst outperforming current state-of-the-art methods.

Motivation

Large Reasoning Models (LRMs) frequently overthink: even after generating the correct final answer early in their Chain-of-Thought, they continue to reason for hundreds or thousands of additional tokens, double-checking, exploring alternatives, and sometimes even changing to a wrong answer.

A natural question follows: can we detect when the LRM has already generated its final answer? We find that the answer is yes. By aligning many CoTs to the position where the final answer first appears and averaging, a clear confidence spike emerges at exactly that position. Individual CoT signals are noisy (Figure 2, left), but event-locked averaging across 3,200 CoTs from math, science, and coding problems reveals a consistent and sharp signal (Figure 2, right).

Event-locked averaging of Token-Confidence and log-probability signals
Figure 2: Event-Locked Averaging of Token-Confidence. Left: single-sample trajectories are noisy. Right: averaging many CoTs aligned by the first answer position reveals a clear confidence spike, consistent across data sources.

While this averaged signal clearly marks the answer position, using it directly during online inference is not straightforward: it requires multiple CoTs and knowledge of the answer position, which is the very thing we are trying to predict. This motivates a learned approach: training a probe on the LRM's hidden states to predict, token by token, whether the final answer has already been generated.

Event-Related Potentials for LRMs

We liken the averaged result in Figure 2 to the field of event-related potential (ERP) research. An ERP is a measurable brain response elicited by a sensory, cognitive, or motor event, captured by electroencephalogram (EEG) recordings (Luck, 2014). However, EEG recordings are often noisy, so ERPs are estimated using time-locked statistical estimators (e.g., averaging) across multiple EEG trials. While we do not claim that our findings will align exactly with ERP research, it is quite interesting that meaningful and observable signals can be extracted from LRMs using similar approaches, and we believe this warrants further exploration in future work.

↑ Back to sections

Key Results

Performance Analysis

Figure 3 shows that Terminator defines the Pareto frontier on 14 of the 16 (LRM, benchmark) pairings, outperforming prior work. The plot covers four reasoning models (Qwen3-8B, Qwen3-14B, Ministral-3-8B-Reasoning-2512, Ministral-3-14B-Reasoning-2512) and four benchmarks (MATH-500, AIME 2025, HumanEval, GPQA). The full underlying numbers are reported in our paper.

Pareto frontiers of accuracy versus compression rate across reasoning models and benchmarks
Figure 3: Pareto Frontier. Terminator defines the Pareto frontier on 14 of the 16 (LRM, benchmark) pairings, outperforming prior work. Each point represents a method's accuracy and compression rate, with lower compression rates indicating greater token savings and hence compute. The dashed line traces the Pareto frontier connecting non-dominated solutions.
Table 1: Performance of Terminator and Baselines. ↑ = higher is better, ↓ = lower is better. CR = compression rate (mean per-sample). Tok = mean tokens per sample. Bold and underlined values highlight the best and second-best early exit methods. Terminator demonstrates superior accuracy-efficiency trade-offs (best or second-best performance across 28 out of 32 metrics).
Method Math Coding Science Overall
MATH-500 AIME 2025 HumanEval GPQA
Acc↑Tok↓CR↓ Acc↑Tok↓CR↓ Acc↑Tok↓CR↓ Acc↑Tok↓CR↓ Acc↑CR↓
Qwen3-8B
Vanilla91.1%5,037100%74.4%14,499100%94.9%3,792100%58.0%8,594100%79.6%100%
NoThinking80.7%80916.1%22.0%2,35518.6%84.6%35311.8%46.0%1,20415.8%58.3%15.6%
DEER79.9%2,60252.0%21.4%10,34967.8%93.7%3,27583.6%50.3%8,55399.6%61.3%75.8%
Thought-Calib90.1%4,37293.9%65.8%11,01481.5%93.9%3,26792.9%56.6%6,24078.9%76.6%86.8%
Dynasor78.3%1,85041.0%48.0%7,47948.8%94.5%2,88378.4%41.7%2,45528.4%65.6%49.2%
Terminator90.7%2,42545.1%69.4%10,97070.7%95.7%2,71669.9%55.7%7,54385.7%77.9%67.8%
Qwen3-14B
Vanilla92.0%4,598100%79.9%14,255100%96.9%3,296100%60.2%7,628100%82.3%100%
NoThinking84.1%78617.5%26.3%2,47219.9%83.7%31712.2%49.8%1,26518.8%61.0%17.1%
DEER80.9%2,50156.2%27.6%10,49771.0%96.9%2,96187.3%52.0%7,45197.4%64.5%78.0%
Thought-Calib89.8%3,77892.0%63.3%9,42971.3%94.3%2,58287.1%57.3%5,75781.9%76.2%83.1%
Dynasor79.6%1,70242.4%61.8%7,93752.8%96.5%2,61182.2%45.7%2,10129.1%70.9%51.6%
Terminator90.7%2,26146.8%74.2%10,78771.0%97.1%2,35870.9%59.6%6,79887.1%80.4%68.9%
Ministral-3-8B-Reasoning-2512
Vanilla93.5%6,212100%92.6%22,124100%97.1%4,367100%63.4%11,765100%86.6%100%
NoThinking83.2%1,90828.1%43.6%7,71136.5%87.6%72716.4%42.4%2,10616.0%64.2%24.3%
DEER71.0%3,79160.3%67.1%17,48177.0%80.1%3,60684.0%61.9%11,31294.1%70.0%78.9%
Thought-Calib87.7%5,69587.8%83.7%20,35891.2%57.9%3,53687.2%50.6%7,40671.8%70.0%84.5%
Dynasor88.1%2,96756.8%87.6%15,40766.3%96.9%3,93188.6%57.7%9,76683.7%82.6%73.9%
Terminator89.1%2,86347.8%79.1%15,74867.8%96.5%2,96066.6%58.2%9,58877.4%80.7%64.9%
Ministral-3-14B-Reasoning-2512
Vanilla93.0%6,385100%88.1%23,694100%97.5%3,918100%62.9%9,539100%83.4%100%
NoThinking79.1%53511.7%20.5%2,41313.8%88.5%52814.1%42.8%5706.6%57.75%11.5%
DEER69.8%4,27974.6%55.9%20,04984.9%20.5%1,68446.5%58.1%9,18595.0%51.1%75.3%
Thought-Calib87.3%5,86095.9%59.4%17,76379.8%45.3%3,46596.3%56.2%6,02873.8%62.1%86.5%
Dynasor86.3%3,24055.5%83.2%17,92070.6%94.7%3,53888.9%55.9%7,91785.2%80.0%75.1%
Terminator90.2%2,94643.9%84.2%15,51863.5%97.8%2,90371.0%61.2%7,72776.5%83.4%63.7%

Latency Analysis

We benchmark Terminator's latency and throughput using a vLLM-compatible implementation on MATH-500 (batch size 1, single GH200). Terminator more than halves the average latency with only a small throughput overhead of 10.8% for Qwen3-8B and 7.5% for Qwen3-14B. Because Terminator's architecture (a single transformer layer and an FFN) remains fixed, the overhead shrinks proportionally as the base LRM grows.

Table 2: Latency Analysis. Latency and throughput benchmarks on MATH-500 (batch size 1) for Qwen3-based vanilla and Terminator models. Values are mean ± 95% CI.
Method Latency (s) Throughput (tok/s)
Qwen3-8B
Vanilla32.68 ± 9.59151.5 ± 4.4
Terminator14.10 ± 6.27135.2 ± 2.0
Qwen3-14B
Vanilla43.38 ± 13.9898.0 ± 2.0
Terminator18.76 ± 6.5290.6 ± 0.8

Hindsight-Optimal Reasoning Length

Since Terminator is trained on hindsight-optimal reasoning length CoTs, it is natural to ask where Terminator lies on the accuracy-compression frontier relative to the ground-truth hindsight-optimal reasoning length. Figure 4 traces test-set accuracy as CoTs are truncated at increasingly early points, forcing the LRM to commit to a final solution and answer. The diamond-shaped markers denote the position where the final answer first appears (i.e., the hindsight-optimal reasoning length). Accuracy stays essentially constant past this point, confirming that reasoning beyond the first answer yields no further gains. Although the hindsight-optimal reasoning length baseline is not achievable by any online method in principle, Terminator lands close to it across all four datasets.

Accuracy versus compression rate as CoTs are truncated early
Figure 4: Effects of Early Reasoning Termination. Test-set CoTs are evaluated after truncating them at various points via the end-of-thought token and asking the LRM for a final solution and answer. Diamond-shaped markers show the hindsight-optimal answer position (not achievable by any online method); Terminator lands close to optimality on all four datasets.
↑ Back to sections

Method

Terminator is a single-layer transformer that sits on top of a base LRM and injects the end-of-thought token when the LRM has generated its final answer. Terminator accepts the final hidden-states of the LRM as input, and predicts a 0 if the answer has not been generated yet, or 1 if it has.

Method diagram
Figure 5: Terminator Architecture. Terminator predicts when the LRM has generated its final answer.
Training Details

Terminator is trained to predict the exact position of the first occurrence of the final answer of an LRM's CoT during online inference. An important challenge we overcome is to identify the exact position of the earliest final answer occurrence for a given CoT. Our training dataset is curated using the following pipeline for robust extraction of the answer position:

  1. Collect CoTs from the LRM
  2. For each CoT, we use a separate LRM to analyze the CoT for the final answer and extract its exact position
  3. Given the token position index of the answer, we construct the labels: set all positions before the answer to 0 and all positions after to 1

Finally, Terminator is trained with binary cross-entropy loss over each position in the CoT. By using the above training strategy, Terminator learns to exit when the LRM has generated the final answer it will eventually commit to.

Inference Details

At inference time, Terminator makes a prediction for each generated token by the LRM. Terminator forces the end-of-thought token to terminate reasoning early by using the following rules:

  • Predict 1 if the predicted confidence is above a fixed threshold (set to 0.7 by default)
  • Count the number of predicted 1s in a sliding window of the most recent predictions (the window size is 10 by default)
  • If the majority (>5 if the window size is 10) of predictions are 1, then the CoT is terminated
Please see the appendix of our paper for further details on the inference procedure and ablations on the threshold and window size hyperparameters.
↑ Back to sections

Additional Results

Below are some additional results that we wish to highlight.

Thinking Token Frequency Shift

"Thinking tokens" such as wait, hmm, okay, and alternatively are associated with ongoing reasoning. We find that they exhibit measurable frequency shifts once the final answer is generated. The plots below show token rates before vs. after the answer for three such tokens. For example, hmm and okay occur more frequently before the answer, while another occurs more frequently after. Dot size reflects relative CoT length. Similar plots with different "thinking tokens" are shown in the appendix of our paper.

Token usage frequency shift before and after final answer
Figure 6: Token Usage Frequency Shift. "Thinking token" rates change depending on whether the final answer has been generated. Dot size reflects relative CoT length.
Terminator Recovers Early-Exit Signals

Our motivation (Figure 2) showed that the first arrival of the final answer is marked by a confidence spike under event-locked averaging, and by a shift in "thinking token" usage. A natural question is whether Terminator, trained only on the LRM's hidden states, recovers these same signals using its own predicted exit positions, rather than the ground-truth answer positions.

Figure 7 mirrors the event-locked averaging of Figure 2, but uses all test samples rather than 3,200 randomly selected training samples. Its left and center panels show the averaged Token-Confidence using ground-truth and Terminator-predicted answer positions, respectively. The two are nearly identical, and the right panel shows that the prediction errors are tightly concentrated near zero, with a median difference of 7 tokens.

Event-locked averaging using ground-truth versus Terminator-predicted exit positions
Figure 7: Terminator Recovers Event-Locked Average Spiking. The exit positions predicted by Terminator (center) recover spiking behavior in the event-locked averaged Token-Confidence similar to the ground-truth answer positions (left). The histogram of differences between exit positions (right, log-scaled y-axis) shows Terminator's predictions are close to ground-truth.

Figure 8 parallels the "thinking token" analysis by overlaying scatter plots computed from ground-truth and predicted answer positions. The inset axes show nearly identical above-diagonal percentages, indicating that Terminator preserves the same before/after token usage biases. Together, these results show that training Terminator on the LRM's hidden states alone is sufficient to independently recover the early-exit signals that motivate our approach.

Thinking-token scatter plots using ground-truth versus Terminator-predicted exit positions
Figure 8: Terminator Token Usage Biases. The exit positions predicted by Terminator recover the same biases in "thinking token" occurrence rates as the ground-truth answer positions. The inset axes show the percentage of dots above the diagonal for ground-truth and Terminator answer positions.
Out-of-Distribution Generalization

Our main results train Terminator on a mix of all four training tasks. To assess out-of-distribution (OOD) generalization, we instead train a separate probe on each individual task and evaluate it on every test set. Figure 9 reports compression rate (left) and accuracy (right), with rows denoting the training dataset and columns the test dataset.

Compression is best in-distribution (along the diagonal), but accuracy does not always follow the same pattern. For example, training on OpenScience yields the lowest GPQA accuracy despite GPQA being in-distribution, whereas training on the OOD OpenCoder-SFT data improves GPQA accuracy but worsens its compression to 96%. In short, OOD evaluation can slightly improve accuracy but often delays exiting, reducing token savings on data not seen during training.

OOD compression rate and accuracy heatmaps across training and evaluation datasets
Figure 9: OOD Performance of Terminator. The best accuracy-compression trade-off is achieved when the evaluation set is in-distribution with the training set. Compression rate (left) and accuracy (right) for Qwen3-8B; training datasets along the rows, evaluation sets along the columns. Each training dataset has an in-domain evaluation set: MATH→MATH-500, AIME 1983–2024→AIME 2025, OpenCoder-SFT→HumanEval, OpenScience→GPQA.
Example 1: MATH-500

For easier problems like those in MATH-500, Terminator shows sharp transitions in predicted confidence at the exiting threshold, with good separation between the "still reasoning" and "answer generated" regimes. By manual inspection, we observe that there is a clearer transition to overthinking on these problems, which Terminator detects well.

Predicted probabilities for four MATH-500 samples
Figure 10: Predicted Probabilities for MATH-500. Terminator's predicted probability stream for early-exiting on four randomly chosen samples from MATH-500.

Figure 11 zooms in on a single MATH-500 sample, showing the full CoT with Terminator's predicted probabilities overlaid. The predicted probability remains low throughout the reasoning phase and rises sharply once the final answer has been generated, triggering the early exit.

Predicted probabilities for a single MATH-500 sample
Figure 11: Predicted Probabilities for MATH-500. Terminator's predicted probabilities for early-exit on a randomly chosen sample from MATH-500. The beginning and the end are truncated for better visibility.
Example 2: AIME 2025

For harder problems like those in AIME 2025, the transition from productive reasoning to overthinking is less obvious, and Terminator's predicted probabilities do not show the same sharp transition seen on MATH-500. This is consistent with the observation that identifying a good exit position is more challenging on very hard tasks.

Predicted probabilities for four AIME 2025 samples
Figure 12: Predicted Probabilities for AIME 2025. Terminator's predicted probability stream for early-exiting on four randomly chosen samples from AIME 2025.

Figure 13 shows a case where Terminator struggles to find a clean exit position. The predicted probabilities oscillate near the threshold rather than making a decisive jump, reflecting the difficulty of the underlying problem. Despite this, Terminator still achieves strong performance on AIME 2025 overall.

Predicted probabilities for a single AIME 2025 sample
Figure 13: Predicted Probabilities for AIME 2025. Terminator's predicted probabilities for early-exit on a randomly chosen sample from AIME 2025. The beginning and the end are truncated for better visibility.
↑ Back to sections

Demo 🚀

Try Terminator yourself! We provide ready-to-run model packages on HuggingFace with a high-performance vLLM server and a standalone inference script.

Available Models

Model Base Model Min. VRAM
Terminator-Qwen3-8B Qwen3-8B ~24 GB
Terminator-Qwen3-14B Qwen3-14B ~40 GB

Quick Start

# Clone a model repo (requires Git LFS: https://git-lfs.com)
git lfs install
git clone https://huggingface.co/acnagle/Terminator-Qwen3-8B
cd Terminator-Qwen3-8B

# Automated setup (creates conda env, installs vLLM, downloads base model)
./setup.sh

# Start the vLLM server
./start_server.sh

# In another terminal, chat with the model
python client.py --interactive
Standalone Inference (No Server)

For quick testing without spinning up a vLLM server, use the HuggingFace-native inference script included in each model repo:

python inference_hf.py --prompt "What is the sum of the first 100 natural numbers?"

Thinking content is streamed in dimmed text; the final answer is shown in bold. For best performance, the vLLM server is recommended.

Using the API Directly

The vLLM server exposes an OpenAI-compatible API, so you can use any OpenAI client library:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Terminator-Qwen3-8B",
    messages=[{"role": "user", "content": "What is 25 * 37?"}],
    temperature=0.6,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

# Thinking content (chain-of-thought)
print(response.choices[0].message.reasoning)

# Final answer
print(response.choices[0].message.content)
↑ Back to sections

Citation

@misc{nagle2026terminatorlearningoptimalexit,
  title   = {TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning},
  author  = {Alliot Nagle and Jakhongir Saydaliev and Dhia Garbaya and Michael Gastpar and Ashok Vardhan Makkuva and Hyeji Kim},
  year    = {2026},
  eprint  = {2603.12529},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url     = {https://arxiv.org/abs/2603.12529}
}
↑ Back to sections