Terminator: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning

Alliot Nagle¹, Jakhongir Saydaliev², Dhia Garbaya³, Michael Gastpar^2,†, Ashok Vardhan Makkuva^4,†, Hyeji Kim^1,†

¹UT Austin ²EPFL ³ENS Paris-Saclay ⁴Télécom Paris

arXiv Code (coming soon!) Data (coming soon!)

TL;DR: Terminator cuts LLM reasoning length by 14%–55%, halves wall-clock latency, and achieves best or second-best performance on 28 of 32 metrics across four models and four benchmarks, all without fine-tuning the base model.

We introduce hindsight-optimal reasoning length, a novel framework for determining where an LRM should stop thinking.
Terminator is a lightweight single-layer probe that sits on top of any LRM. No fine-tuning of the base model is required.
Best or second-best on 28 of 32 metrics across four models (Qwen3-8B/14B, Ministral-8B/14B) and four benchmarks (MATH-500, AIME 2025, HumanEval, GPQA).
2× faster wall-clock inference on average with only ~10% throughput overhead.

Figure 1: Early Stopping via Terminator. Side-by-side comparison of Terminator-Qwen3-8B (left) vs. vanilla Qwen3-8B (right) on the following MATH-500 question: "Define $p = \sum_{k=1}^\infty \frac{1}{k^2}$ and $q = \sum_{k=1}^\infty \frac{1}{k^3}.$ Find a way to write $\sum_{j=1}^\infty \sum_{k=1}^\infty \frac{1}{(j+k)^3}$ in terms of $p$ and $q.$"

Abstract

Large Reasoning Models (LRMs) achieve impressive performance on complex reasoning tasks via Chain-of-Thought (CoT) reasoning, which enables them to generate intermediate thinking tokens before arriving at the final answer. However, LRMs often suffer from significant overthinking, spending excessive compute time even after the answer is generated early on. Prior work has identified the existence of an optimal reasoning length such that truncating reasoning at this point significantly shortens CoT outputs with virtually no change in performance. However, determining optimal CoT lengths for practical datasets is highly non-trivial as they are fully task and model-dependent. We precisely address this and design Terminator, an early-exit strategy for LRMs at inference to mitigate overthinking. The central idea underpinning Terminator is that the first arrival of an LRM's final answer is often predictable, and we leverage these first answer positions to create a novel dataset of optimal reasoning lengths to train Terminator. Powered by this approach, Terminator achieves significant reductions in CoT lengths of 14% – 55% on average across four challenging practical datasets: MATH-500, AIME 2025, HumanEval, and GPQA, whilst outperforming current state-of-the-art methods.

Motivation

Large Reasoning Models (LRMs) frequently overthink: even after generating the correct final answer early in their Chain-of-Thought, they continue to reason for hundreds or thousands of additional tokens, double-checking, exploring alternatives, and sometimes even changing to a wrong answer.

A natural question follows: can we detect when the LRM has already generated its final answer? We find that the answer is yes. By aligning many CoTs to the position where the final answer first appears and averaging, a clear confidence spike emerges at exactly that position. Individual CoT signals are noisy (Figure 2, left), but event-locked averaging across 3,200 CoTs from math, science, and coding problems reveals a consistent and sharp signal (Figure 2, right).

Event-locked averaging of Token-Confidence and log-probability signals

Figure 2: Event-Locked Averaging of Token-Confidence. Left: single-sample trajectories are noisy. Right: averaging many CoTs aligned by the first answer position reveals a clear confidence spike, consistent across data sources.

While this averaged signal clearly marks the answer position, using it directly during online inference is not straightforward: it requires multiple CoTs and knowledge of the answer position, which is the very thing we are trying to predict. This motivates a learned approach: training a probe on the LRM's hidden states to predict, token by token, whether the final answer has already been generated.

Event-Related Potentials for LRMs

We liken the averaged result in Figure 2 to the field of event-related potential (ERP) research. An ERP is a measurable brain response elicited by a sensory, cognitive, or motor event, captured by electroencephalogram (EEG) recordings (Luck, 2014). However, EEG recordings are often noisy, so ERPs are estimated using time-locked statistical estimators (e.g., averaging) across multiple EEG trials. While we do not claim that our findings will align exactly with ERP research, it is quite interesting that meaningful and observable signals can be extracted from LRMs using similar approaches, and we believe this warrants further exploration in future work.

↑ Back to sections

Key Results

Performance Analysis

Figure 3 shows that Terminator defines the Pareto frontier on 14 of the 16 (LRM, benchmark) pairings, outperforming prior work. The plot covers four reasoning models (Qwen3-8B, Qwen3-14B, Ministral-3-8B-Reasoning-2512, Ministral-3-14B-Reasoning-2512) and four benchmarks (MATH-500, AIME 2025, HumanEval, GPQA). The full underlying numbers are reported in our paper.

Pareto frontiers of accuracy versus compression rate across reasoning models and benchmarks

Figure 3: Pareto Frontier. Terminator defines the Pareto frontier on 14 of the 16 (LRM, benchmark) pairings, outperforming prior work. Each point represents a method's accuracy and compression rate, with lower compression rates indicating greater token savings and hence compute. The dashed line traces the Pareto frontier connecting non-dominated solutions.

Table 1: Performance of Terminator and Baselines. ↑ = higher is better, ↓ = lower is better. CR = compression rate (mean per-sample). Tok = mean tokens per sample. Bold and underlined values highlight the best and second-best early exit methods. Terminator demonstrates superior accuracy-efficiency trade-offs (best or second-best performance across 28 out of 32 metrics).

Method	Math						Coding			Science			Overall
	MATH-500			AIME 2025			HumanEval			GPQA			Overall
	Acc↑	Tok↓	CR↓	Acc↑	Tok↓	CR↓	Acc↑	Tok↓	CR↓	Acc↑	Tok↓	CR↓	Acc↑	CR↓
Qwen3-8B
Vanilla	91.1%	5,037	100%	74.4%	14,499	100%	94.9%	3,792	100%	58.0%	8,594	100%	79.6%	100%
NoThinking	80.7%	809	16.1%	22.0%	2,355	18.6%	84.6%	353	11.8%	46.0%	1,204	15.8%	58.3%	15.6%
DEER	79.9%	2,602	52.0%	21.4%	10,349	67.8%	93.7%	3,275	83.6%	50.3%	8,553	99.6%	61.3%	75.8%
Thought-Calib	90.1%	4,372	93.9%	65.8%	11,014	81.5%	93.9%	3,267	92.9%	56.6%	6,240	78.9%	76.6%	86.8%
Dynasor	78.3%	1,850	41.0%	48.0%	7,479	48.8%	94.5%	2,883	78.4%	41.7%	2,455	28.4%	65.6%	49.2%
Terminator	90.7%	2,425	45.1%	69.4%	10,970	70.7%	95.7%	2,716	69.9%	55.7%	7,543	85.7%	77.9%	67.8%
Qwen3-14B
Vanilla	92.0%	4,598	100%	79.9%	14,255	100%	96.9%	3,296	100%	60.2%	7,628	100%	82.3%	100%
NoThinking	84.1%	786	17.5%	26.3%	2,472	19.9%	83.7%	317	12.2%	49.8%	1,265	18.8%	61.0%	17.1%
DEER	80.9%	2,501	56.2%	27.6%	10,497	71.0%	96.9%	2,961	87.3%	52.0%	7,451	97.4%	64.5%	78.0%
Thought-Calib	89.8%	3,778	92.0%	63.3%	9,429	71.3%	94.3%	2,582	87.1%	57.3%	5,757	81.9%	76.2%	83.1%
Dynasor	79.6%	1,702	42.4%	61.8%	7,937	52.8%	96.5%	2,611	82.2%	45.7%	2,101	29.1%	70.9%	51.6%
Terminator	90.7%	2,261	46.8%	74.2%	10,787	71.0%	97.1%	2,358	70.9%	59.6%	6,798	87.1%	80.4%	68.9%
Ministral-3-8B-Reasoning-2512
Vanilla	93.5%	6,212	100%	92.6%	22,124	100%	97.1%	4,367	100%	63.4%	11,765	100%	86.6%	100%
NoThinking	83.2%	1,908	28.1%	43.6%	7,711	36.5%	87.6%	727	16.4%	42.4%	2,106	16.0%	64.2%	24.3%
DEER	71.0%	3,791	60.3%	67.1%	17,481	77.0%	80.1%	3,606	84.0%	61.9%	11,312	94.1%	70.0%	78.9%
Thought-Calib	87.7%	5,695	87.8%	83.7%	20,358	91.2%	57.9%	3,536	87.2%	50.6%	7,406	71.8%	70.0%	84.5%
Dynasor	88.1%	2,967	56.8%	87.6%	15,407	66.3%	96.9%	3,931	88.6%	57.7%	9,766	83.7%	82.6%	73.9%
Terminator	89.1%	2,863	47.8%	79.1%	15,748	67.8%	96.5%	2,960	66.6%	58.2%	9,588	77.4%	80.7%	64.9%
Ministral-3-14B-Reasoning-2512
Vanilla	93.0%	6,385	100%	88.1%	23,694	100%	97.5%	3,918	100%	62.9%	9,539	100%	83.4%	100%
NoThinking	79.1%	535	11.7%	20.5%	2,413	13.8%	88.5%	528	14.1%	42.8%	570	6.6%	57.75%	11.5%
DEER	69.8%	4,279	74.6%	55.9%	20,049	84.9%	20.5%	1,684	46.5%	58.1%	9,185	95.0%	51.1%	75.3%
Thought-Calib	87.3%	5,860	95.9%	59.4%	17,763	79.8%	45.3%	3,465	96.3%	56.2%	6,028	73.8%	62.1%	86.5%
Dynasor	86.3%	3,240	55.5%	83.2%	17,920	70.6%	94.7%	3,538	88.9%	55.9%	7,917	85.2%	80.0%	75.1%
Terminator	90.2%	2,946	43.9%	84.2%	15,518	63.5%	97.8%	2,903	71.0%	61.2%	7,727	76.5%	83.4%	63.7%

Latency Analysis

We benchmark Terminator's latency and throughput using a vLLM-compatible implementation on MATH-500 (batch size 1, single GH200). Terminator more than halves the average latency with only a small throughput overhead of 10.8% for Qwen3-8B and 7.5% for Qwen3-14B. Because Terminator's architecture (a single transformer layer and an FFN) remains fixed, the overhead shrinks proportionally as the base LRM grows.

Table 2: Latency Analysis. Latency and throughput benchmarks on MATH-500 (batch size 1) for Qwen3-based vanilla and Terminator models. Values are mean ± 95% CI.

Method	Latency (s)	Throughput (tok/s)
Qwen3-8B
Vanilla	32.68 ± 9.59	151.5 ± 4.4
Terminator	14.10 ± 6.27	135.2 ± 2.0
Qwen3-14B
Vanilla	43.38 ± 13.98	98.0 ± 2.0
Terminator	18.76 ± 6.52	90.6 ± 0.8

Hindsight-Optimal Reasoning Length

Since Terminator is trained on hindsight-optimal reasoning length CoTs, it is natural to ask where Terminator lies on the accuracy-compression frontier relative to the ground-truth hindsight-optimal reasoning length. Figure 4 traces test-set accuracy as CoTs are truncated at increasingly early points, forcing the LRM to commit to a final solution and answer. The diamond-shaped markers denote the position where the final answer first appears (i.e., the hindsight-optimal reasoning length). Accuracy stays essentially constant past this point, confirming that reasoning beyond the first answer yields no further gains. Although the hindsight-optimal reasoning length baseline is not achievable by any online method in principle, Terminator lands close to it across all four datasets.

Accuracy versus compression rate as CoTs are truncated early

Figure 4: Effects of Early Reasoning Termination. Test-set CoTs are evaluated after truncating them at various points via the end-of-thought token and asking the LRM for a final solution and answer. Diamond-shaped markers show the hindsight-optimal answer position (not achievable by any online method); Terminator lands close to optimality on all four datasets.

↑ Back to sections

Method

Terminator is a single-layer transformer that sits on top of a base LRM and injects the end-of-thought token when the LRM has generated its final answer. Terminator accepts the final hidden-states of the LRM as input, and predicts a 0 if the answer has not been generated yet, or 1 if it has.

Figure 5: Terminator Architecture. Terminator predicts when the LRM has generated its final answer.

Training Details

Terminator is trained to predict the exact position of the first occurrence of the final answer of an LRM's CoT during online inference. An important challenge we overcome is to identify the exact position of the earliest final answer occurrence for a given CoT. Our training dataset is curated using the following pipeline for robust extraction of the answer position:

Collect CoTs from the LRM
For each CoT, we use a separate LRM to analyze the CoT for the final answer and extract its exact position
Given the token position index of the answer, we construct the labels: set all positions before the answer to 0 and all positions after to 1

Finally, Terminator is trained with binary cross-entropy loss over each position in the CoT. By using the above training strategy, Terminator learns to exit when the LRM has generated the final answer it will eventually commit to.

Inference Details

At inference time, Terminator makes a prediction for each generated token by the LRM. Terminator forces the end-of-thought token to terminate reasoning early by using the following rules:

Predict 1 if the predicted confidence is above a fixed threshold (set to 0.7 by default)
Count the number of predicted 1s in a sliding window of the most recent predictions (the window size is 10 by default)
If the majority (>5 if the window size is 10) of predictions are 1, then the CoT is terminated

Please see the appendix of our paper for further details on the inference procedure and ablations on the threshold and window size hyperparameters.

↑ Back to sections

Additional Results

Below are some additional results that we wish to highlight.

Thinking Token Frequency Shift

"Thinking tokens" such as wait, hmm, okay, and alternatively are associated with ongoing reasoning. We find that they exhibit measurable frequency shifts once the final answer is generated. The plots below show token rates before vs. after the answer for three such tokens. For example, hmm and okay occur more frequently before the answer, while another occurs more frequently after. Dot size reflects relative CoT length. Similar plots with different "thinking tokens" are shown in the appendix of our paper.

Token usage frequency shift before and after final answer

Figure 6: Token Usage Frequency Shift. "Thinking token" rates change depending on whether the final answer has been generated. Dot size reflects relative CoT length.

Terminator Recovers Early-Exit Signals

Our motivation (Figure 2) showed that the first arrival of the final answer is marked by a confidence spike under event-locked averaging, and by a shift in "thinking token" usage. A natural question is whether Terminator, trained only on the LRM's hidden states, recovers these same signals using its own predicted exit positions, rather than the ground-truth answer positions.

Figure 7 mirrors the event-locked averaging of Figure 2, but uses all test samples rather than 3,200 randomly selected training samples. Its left and center panels show the averaged Token-Confidence using ground-truth and Terminator-predicted answer positions, respectively. The two are nearly identical, and the right panel shows that the prediction errors are tightly concentrated near zero, with a median difference of 7 tokens.

Event-locked averaging using ground-truth versus Terminator-predicted exit positions

Figure 7: Terminator Recovers Event-Locked Average Spiking. The exit positions predicted by Terminator (center) recover spiking behavior in the event-locked averaged Token-Confidence similar to the ground-truth answer positions (left). The histogram of differences between exit positions (right, log-scaled y-axis) shows Terminator's predictions are close to ground-truth.

Figure 8 parallels the "thinking token" analysis by overlaying scatter plots computed from ground-truth and predicted answer positions. The inset axes show nearly identical above-diagonal percentages, indicating that Terminator preserves the same before/after token usage biases. Together, these results show that training Terminator on the LRM's hidden states alone is sufficient to independently recover the early-exit signals that motivate our approach.

Thinking-token scatter plots using ground-truth versus Terminator-predicted exit positions

Figure 8: Terminator Token Usage Biases. The exit positions predicted by Terminator recover the same biases in "thinking token" occurrence rates as the ground-truth answer positions. The inset axes show the percentage of dots above the diagonal for ground-truth and Terminator answer positions.

Out-of-Distribution Generalization

Our main results train Terminator on a mix of all four training tasks. To assess out-of-distribution (OOD) generalization, we instead train a separate probe on each individual task and evaluate it on every test set. Figure 9 reports compression rate (left) and accuracy (right), with rows denoting the training dataset and columns the test dataset.

Compression is best in-distribution (along the diagonal), but accuracy does not always follow the same pattern. For example, training on OpenScience yields the lowest GPQA accuracy despite GPQA being in-distribution, whereas training on the OOD OpenCoder-SFT data improves GPQA accuracy but worsens its compression to 96%. In short, OOD evaluation can slightly improve accuracy but often delays exiting, reducing token savings on data not seen during training.

OOD compression rate and accuracy heatmaps across training and evaluation datasets

Figure 9: OOD Performance of Terminator. The best accuracy-compression trade-off is achieved when the evaluation set is in-distribution with the training set. Compression rate (left) and accuracy (right) for Qwen3-8B; training datasets along the rows, evaluation sets along the columns. Each training dataset has an in-domain evaluation set: MATH→MATH-500, AIME 1983–2024→AIME 2025, OpenCoder-SFT→HumanEval, OpenScience→GPQA.

Example 1: MATH-500

For easier problems like those in MATH-500, Terminator shows sharp transitions in predicted confidence at the exiting threshold, with good separation between the "still reasoning" and "answer generated" regimes. By manual inspection, we observe that there is a clearer transition to overthinking on these problems, which Terminator detects well.

Predicted probabilities for four MATH-500 samples

Figure 10: Predicted Probabilities for MATH-500. Terminator's predicted probability stream for early-exiting on four randomly chosen samples from MATH-500.

Figure 11 zooms in on a single MATH-500 sample, showing the full CoT with Terminator's predicted probabilities overlaid. The predicted probability remains low throughout the reasoning phase and rises sharply once the final answer has been generated, triggering the early exit.

Predicted probabilities for a single MATH-500 sample

Figure 11: Predicted Probabilities for MATH-500. Terminator's predicted probabilities for early-exit on a randomly chosen sample from MATH-500. The beginning and the end are truncated for better visibility.

Example 2: AIME 2025

For harder problems like those in AIME 2025, the transition from productive reasoning to overthinking is less obvious, and Terminator's predicted probabilities do not show the same sharp transition seen on MATH-500. This is consistent with the observation that identifying a good exit position is more challenging on very hard tasks.

Predicted probabilities for four AIME 2025 samples

Figure 12: Predicted Probabilities for AIME 2025. Terminator's predicted probability stream for early-exiting on four randomly chosen samples from AIME 2025.

Figure 13 shows a case where Terminator struggles to find a clean exit position. The predicted probabilities oscillate near the threshold rather than making a decisive jump, reflecting the difficulty of the underlying problem. Despite this, Terminator still achieves strong performance on AIME 2025 overall.

Predicted probabilities for a single AIME 2025 sample

Figure 13: Predicted Probabilities for AIME 2025. Terminator's predicted probabilities for early-exit on a randomly chosen sample from AIME 2025. The beginning and the end are truncated for better visibility.

↑ Back to sections

Demo 🚀

Try Terminator yourself! We provide ready-to-run model packages on HuggingFace with a high-performance vLLM server and a standalone inference script.

Available Models

Model	Base Model	Min. VRAM
Terminator-Qwen3-8B	Qwen3-8B	~24 GB
Terminator-Qwen3-14B	Qwen3-14B	~40 GB

Quick Start

# Clone a model repo (requires Git LFS: https://git-lfs.com)
git lfs install
git clone https://huggingface.co/acnagle/Terminator-Qwen3-8B
cd Terminator-Qwen3-8B

# Automated setup (creates conda env, installs vLLM, downloads base model)
./setup.sh

# Start the vLLM server
./start_server.sh

# In another terminal, chat with the model
python client.py --interactive

Standalone Inference (No Server)

For quick testing without spinning up a vLLM server, use the HuggingFace-native inference script included in each model repo:

python inference_hf.py --prompt "What is the sum of the first 100 natural numbers?"

Thinking content is streamed in dimmed text; the final answer is shown in bold. For best performance, the vLLM server is recommended.

Using the API Directly

The vLLM server exposes an OpenAI-compatible API, so you can use any OpenAI client library:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="Terminator-Qwen3-8B",
    messages=[{"role": "user", "content": "What is 25 * 37?"}],
    temperature=0.6,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

# Thinking content (chain-of-thought)
print(response.choices[0].message.reasoning)

# Final answer
print(response.choices[0].message.content)

↑ Back to sections

Citation

@misc{nagle2026terminatorlearningoptimalexit,
  title   = {TERMINATOR: Learning Optimal Exit Points for Early Stopping in Chain-of-Thought Reasoning},
  author  = {Alliot Nagle and Jakhongir Saydaliev and Dhia Garbaya and Michael Gastpar and Ashok Vardhan Makkuva and Hyeji Kim},
  year    = {2026},
  eprint  = {2603.12529},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url     = {https://arxiv.org/abs/2603.12529}
}

↑ Back to sections