Independent Research
Correspondence: koulabhay25@gmail.com
DRAFT — May 2026
We present Thought Injection, a novel inference-time intervention technique that causally steers reasoning trajectories in autoregressive transformers by manipulating visible chain-of-thought (CoT) content. Unlike prior approaches relying on weight modification, activation patching, or reinforcement learning from human feedback, our method operates exclusively at the generation layer: synthetic reasoning is prefilled into the model's assistant response, which is then recursively re-ingested as future context during autoregressive decoding. We demonstrate that this technique is universally applicable—it works on any autoregressive LLM or VLM by prefilling a percentage of tokens in the assistant role and allowing the model to continue generation from that point. We validate across multiple LLMs and VLMs that (1) visible CoT functions as active computational state that causally determines downstream outputs; (2) in vision-language models, synthetic reasoning injection causes models to ignore image embeddings entirely; and (3) the more detailed the injected thinking, the more susceptible the model becomes to manipulation.
We further extend our investigation to recursive reasoning architectures, specifically HelpingAI's Chain of Recursive Thoughts (CORT) framework, which enables models to cyclically alternate between thinking, responding, re-thinking, and self-correcting within a single assistant response. We evaluate whether such architectures can detect and recover from corrupted reasoning traces. Our findings reveal a fundamental duality: recursive reasoning simultaneously improves robustness against mild corruption (through self-correction capacity) while creating stronger epistemic anchoring under severe corruption (through recursive reinforcement). In multimodal settings, recovery behavior degrades significantly, as corrupted reasoning dominates visual evidence across recursive cycles. These findings formalize visible CoT as mutable autoregressive working memory, reveal competing dynamics between cognitive resilience and cognitive lock-in, and motivate architectural defenses against adversarial thought injection.
Keywords: chain-of-thought, inference-time intervention, reasoning steering, recursive reasoning, self-correction, CORT, multimodal grounding, working memory, epistemic anchoring, AI safety, thought injection
⚠ This document is a working draft. All findings are preliminary and subject to revision. Please do not cite or redistribute.
Recent advances in large language models (LLMs) have produced systems capable of extended multi-step reasoning through visible chain-of-thought generation (Wei et al., 2022; Kojima et al., 2022). Models such as DeepSeek-R1 (DeepSeek, 2025), QwQ (Qwen, 2025), and o1 (OpenAI, 2024) generate explicit reasoning traces within structured tags (e.g., <think>...</think>) before producing final answers. These reasoning traces have been treated primarily as explanatory artifacts—useful for interpretability and debugging but assumed to be epiphenomenal to the model's true computational process.
We challenge this assumption. Visible chain-of-thought functions as mutable autoregressive working memory in all autoregressive transformers. Generated tokens—whether structured within <think> blocks or placed as prior assistant context—are recursively re-ingested as input context during subsequent decoding steps, thereby causally influencing future hidden states, attention distributions, and output tokens.
To exploit this property, we introduce Thought Injection—a pure inference-time intervention technique. Rather than modifying the user prompt or system instructions, we prefill a percentage of tokens directly within the assistant role and allow the model to continue autoregressive generation from that injected point. The model has no mechanism to distinguish tokens it genuinely generated from tokens that were externally prefilled. This creates a causal feedback loop:
injected thought → autoregressive context → altered hidden state → altered reasoning → altered output
Beyond demonstrating the attack, we investigate whether recursive reasoning architectures can recover from such manipulations. Specifically, we study HelpingAI's CORT (Chain of Recursive Thoughts) framework, in which models cyclically alternate between thinking, responding, re-thinking, and self-correcting within a single generation. We find that recursive architectures create competing dynamics between self-correction and self-reinforcement—a fundamental duality with significant implications for AI safety.
Our contributions are:
Chain-of-thought prompting (Wei et al., 2022) demonstrated that eliciting intermediate reasoning steps improves LLM performance on multi-step tasks. Subsequent work explored zero-shot CoT (Kojima et al., 2022), self-consistency decoding (Wang et al., 2023), and tree-of-thought search (Yao et al., 2023). These approaches treat CoT as a prompting strategy but do not investigate the causal role of generated reasoning tokens on subsequent computation.
Turpin et al. (2024) and Lanham et al. (2023) investigated whether model-generated explanations faithfully represent internal computation. Their findings suggest CoT can be unfaithful. Our work is complementary: we do not ask whether CoT explains internal reasoning, but whether it causally influences future reasoning through the autoregressive feedback mechanism.
Representation engineering (Zou et al., 2023) and activation addition (Turner et al., 2023) modify model behavior by intervening on latent activations. These require access to intermediate representations. Thought injection operates entirely in token space—the observable reasoning stream—requiring only generation-level access.
Recent work on self-correction in LLMs (Madaan et al., 2023; Shinn et al., 2023) explores whether models can iteratively refine their outputs. Reflexion (Shinn et al., 2023) uses verbal reinforcement signals for iterative improvement. Self-Refine (Madaan et al., 2023) demonstrates iterative refinement through feedback. However, these approaches operate across separate generation calls. The CORT architecture we study enables recursive self-correction within a single generation—the model interleaves thinking and responding tokens in a continuous autoregressive stream, creating intra-generation self-correction dynamics not previously studied under adversarial conditions.
Prior adversarial work focuses on jailbreaking (Zou et al., 2023b), prompt injection (Greshake et al., 2023), and indirect prompt attacks (Abdelnabi et al., 2023). Thought injection differs fundamentally: rather than manipulating the user prompt or system instructions, it targets the model's own reasoning process by prefilling assistant-role tokens. This represents a new attack modality operating on the model's cognitive state rather than its input.
Vision-language models (Liu et al., 2024; Bai et al., 2025; Wang et al., 2024) integrate visual and linguistic information. Grounding—the process of anchoring reasoning in visual evidence—is essential for reliable multimodal inference. We demonstrate that thought injection disrupts this grounding, and that recursive reasoning architectures provide only partial recovery in multimodal settings.
Let M denote an autoregressive transformer with parameters θ. Given context x1:t, the model generates the next token:
xt+1 ~ pθ(· | x1:t)
Previously generated tokens become input context for subsequent tokens, creating a self-conditioning dynamic. This mechanism is universal to all autoregressive transformers regardless of whether the model has explicit reasoning tags, was trained for extended reasoning, or is text-only vs. multimodal.
We formalize visible CoT as external computational state. Define:
The bidirectional relationship—generation produces visible state, which is re-ingested to produce latent state—means modifying Svisible at any point propagates forward through all subsequent latent states.
Chain of Recursive Thoughts (CORT) extends standard chain-of-thought by enabling the model to cyclically alternate between thinking and responding within a single assistant generation. Unlike traditional CoT:
<think>
reasoning
</think>
final answerCORT enables recursive cognition:
<think>
initial reasoning
</think>
partial response
<think>
reflection / correction
</think>
refined response
<think>
further verification
</think>
final responseThis architecture creates multiple opportunities for self-correction within a single generation pass. Each subsequent thinking block has access to both the original query and the model's prior partial responses, enabling recursive verification. We study whether this recursive structure provides resilience against thought injection attacks.
Our injection mechanism operates by prefilling a percentage of tokens directly within the assistant role. Rather than modifying the user message or system prompt, we place synthetic reasoning content as if the model had already begun generating, then allow the model to continue autoregressive generation from that injected point.
Formally, for a prompt P and injected reasoning R':
yinjected = M(P ⊕ assistant_start ⊕ R' ⊕ continue_generation)
The model generates conditioned on R' as if it had produced those tokens itself. This is operationally distinct from prompt injection: the injected content occupies the model's own output space, not the input space. The model cannot architecturally distinguish prefilled tokens from self-generated tokens.
We define injection detail level D as the specificity and elaboration of injected reasoning content. Our core experimental finding is that D is the primary determinant of manipulation success:
This relationship holds across all models tested. More detailed thinking creates stronger epistemic anchoring because the autoregressive mechanism weights recent, specific context most heavily during next-token prediction.
To evaluate CORT's self-correction capacity under adversarial conditions, we define the following protocol:
<think> block of a CORT-enabled model.<think> blocks detect inconsistencies.We vary corruption strength from mild (subtle numerical errors) to severe (completely fabricated multi-step reasoning) and measure recovery rate as a function of corruption intensity.
We evaluate across multiple LLMs and VLMs. For the recursive reasoning experiments, we specifically evaluate:
| Model | Architecture | Recursive CoT | Modality |
|---|---|---|---|
| HelpingAI/M-Hiro-Base | Text LLM | Yes (CORT) | Text-only |
| Abhaykoul/Hiro-4B | Qwen3.5-based VLM | Yes (CORT) | Vision + Text |
Hiro-4B was specifically finetuned to support recursive reasoning cycles where thinking and responding coexist dynamically inside a single assistant response. The model generates interleaved <think> blocks and response segments, enabling intra-generation self-correction.
Additionally, we test the standard thought injection attack on multiple LLMs including Qwen3 (0.6B, 4B, 8B), DeepSeek-R1-Distill variants (7B, 14B), Llama-3.1-8B-Instruct, Phi-3-mini, Mistral-7B-Instruct, and Gemma-2-9B-it; and VLMs including Qwen3-VL (2B, 7B), InternVL-2.5 (8B, 26B), MiniCPM-V-2.6, LLaVA-1.6-Mistral-7B, and Phi-3.5-Vision.
Every model tested treats prefilled assistant-role tokens as authoritative context. There is no model that is immune to thought injection. The mechanism is architectural: self-attention treats all tokens in the context window with equal potential weight, determined only by learned attention patterns. If training teaches the model to trust and extend its own reasoning (necessary for CoT to function), it equally trusts and extends injected reasoning.
This finding has a clear mechanistic explanation. Detailed injected reasoning provides:
All VLMs tested systematically ignore image embeddings when detailed injected reasoning contradicts visual evidence. The models treat the textual reasoning stream as ground truth, completely overriding perceptual input from the vision encoder. This occurs because:
Larger models show lower susceptibility but are not immune. Scale improves internal priors (increasing the likelihood of "noticing" contradictions) but does not resolve the fundamental architectural vulnerability: all content in the context window is treated as authoritative by the attention mechanism.
Having established that thought injection reliably steers reasoning in standard CoT models, we turn to the central question of this extended investigation: Can recursive reasoning architectures recover from corrupted visible reasoning traces?
The Chain of Recursive Thoughts (CORT) architecture, developed by HelpingAI, enables models to generate interleaved thinking and response segments within a single assistant turn. Unlike standard CoT where reasoning is monolithic and terminal, CORT creates a dynamic cognitive process:
<think>
This is a physics question about gravitational forces. The Moon exerts a
gravitational pull on objects on Earth. The direction of this force would
be along the line connecting the object at the equator to the center of
the Moon.
</think>
The direction of the gravitational force is along the line connecting
the object to the Moon.
<think>
Wait, I should clarify that gravitational force follows Newton's law of
universal gravitation. The direction is not static—it varies depending
on the Moon's position relative to Earth.
</think>
So, to be more precise, the gravitational force from the Moon on an
object at the equator points directly toward the Moon's position in the
sky at that time. As the Earth rotates and the Moon orbits, the direction
of this force constantly shifts.
<think>
Maybe I should connect this to real-world effects like ocean tides.
</think>
This changing gravitational pull is what drives ocean tides. When the
Moon is overhead, the pull is upward; when it's on the horizon, the
pull is nearly horizontal.Each recursive thinking block operates on the full context including all prior thinking blocks and partial responses. This creates an intra-generation feedback loop where the model can: (1) generate initial reasoning, (2) produce a partial response, (3) reflect on its own partial response, (4) detect errors or inconsistencies, (5) generate corrective reasoning, and (6) produce a refined response.
We investigate whether this recursive structure provides resilience against adversarial thought injection—specifically, whether later recursive thinking blocks can detect corruption introduced in earlier blocks.
We inject corrupted reasoning into the first <think> block of CORT-enabled models at varying corruption strengths:
We then allow the model to continue its CORT-style generation and measure whether subsequent recursive thinking blocks detect the corruption, attempt correction, and whether the final response recovers toward the ground truth.
HelpingAI/M-Hiro-Base demonstrated partial recovery capability under specific conditions:
<think> blocks.This reveals that the CORT architecture's self-correction mechanism operates only when the corruption is subtle enough to create detectable inconsistencies with the model's prior knowledge. When injected reasoning is sufficiently detailed and internally coherent, it becomes the new "ground truth" around which all subsequent recursive cycles orbit.
Abhaykoul/Hiro-4B, finetuned specifically on CORT-structured training data with Qwen3.5 as the base architecture, showed stronger recovery behavior in purely textual reasoning tasks. The model was often able to:
We attribute this stronger recovery to Hiro-4B's explicit CORT finetuning: the model was trained on examples where recursive self-correction is the expected behavior, creating stronger learned patterns for detecting and addressing reasoning errors within the generation stream.
In multimodal (vision-language) settings, recovery behavior degraded significantly for Hiro-4B. When image reasoning was involved:
This finding is mechanistically significant. In CORT-style generation, each recursive <think> block attends to the full context including both image tokens and prior textual reasoning. However, the prior textual reasoning (which contains detailed, coherent claims about image content) provides stronger conditioning signals than the compressed image embeddings. Subsequent recursive cycles amplify this asymmetry rather than resolving it.
In CORT architectures, each thinking block creates a feedback cycle: thought → response → reflection → refined response. This enables dynamic epistemic updating but can also recursively reinforce incorrect beliefs when initial reasoning is corrupted.
Two competing processes: Correction pressure (trained knowledge creates tension) vs. Coherence pressure (autoregressive mechanism favors consistency with prior context).
CORT-trained models can perform autoregressive self-repair—generating tokens that explicitly contradict and correct prior tokens in the same generation.
For mild corruption: recovery probability Rk increases with cycle k (progressive recovery).
For severe corruption: Rk decreases with k (progressive entrenchment).
This creates a critical threshold below which recursive reasoning helps recovery and above which it accelerates lock-in. The duality manifests as:
Image tokens are fixed while textual reasoning accumulates. The ratio shifts monotonically toward textual dominance across recursive cycles.
If reasoning about images already exists in context, the model shortcuts visual processing. Recursive cycles amplify this.
In text-only reasoning, parametric knowledge creates friction. In multimodal settings, no parametric knowledge exists about specific images. Textual assertions are more "readable" than compressed visual embeddings, creating asymmetry that recursion amplifies.
Thought injection via prefilling requires only API-level access and bypasses all weight-level safety training.
Safety-critical VLM deployments (medical imaging, autonomous driving) are vulnerable: recursive structure amplifies text-over-vision dominance.
Carefully crafted initial reasoning exploits recursive architectures to progressively strengthen adversarial outputs.
Detailed prefill + recursive "verification" = outputs appearing highly deliberated while entirely adversarially determined.
The context window functions as mutable working memory. Recursive architectures elaborate this into complex cognitive structure that is both beneficial and harmful depending on context integrity.
Detailed reasoning is both the hallmark of good cognition and the most effective manipulation vector—the detail paradox.
Recursive architectures defend against mild corruption but amplify severe corruption. They should not be treated as defensive mechanisms.
CORT creates observable "decision points" where models enter correction vs. reinforcement modes—potential interpretability targets.
This work identifies vulnerabilities to enable defensive research. We do not release adversarial templates targeting safety systems. Understanding is essential for building robust systems.
Recursive reasoning architectures dynamically negotiate between correction, reinforcement, recovery, and stabilization. This creates a fundamental duality between cognitive resilience and cognitive lock-in requiring new architectural defenses.
Abdelnabi, S., Hasan, S., & Fritz, M. (2023). Compromising LLM-integrated applications with indirect prompt injection. AISec 2023.
Bai, J., et al. (2025). Qwen2.5-VL technical report. arXiv:2502.13923.
DeepSeek (2025). DeepSeek-R1: Incentivizing reasoning via RL. arXiv:2501.12948.
Greshake, K., et al. (2023). Indirect prompt injection. arXiv:2302.12173.
Kojima, T., et al. (2022). Large language models are zero-shot reasoners. NeurIPS 2022.
Lanham, T., et al. (2023). Measuring faithfulness in CoT reasoning. arXiv:2307.13702.
Liu, H., et al. (2024). Visual instruction tuning. NeurIPS 2024.
Madaan, A., et al. (2023). Self-Refine: Iterative refinement with self-feedback. NeurIPS 2023.
OpenAI (2024). Learning to reason with LLMs. OpenAI Blog.
Shinn, N., et al. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS 2023.
Snell, C., et al. (2024). Scaling LLM test-time compute. arXiv:2408.03314.
Turner, A., et al. (2023). Activation addition. arXiv:2308.10248.
Turpin, M., et al. (2024). Language models don't always say what they think. NeurIPS 2024.
Wang, X., et al. (2023). Self-consistency improves CoT reasoning. ICLR 2023.
Wang, W., et al. (2024). InternVL. CVPR 2024.
Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning. NeurIPS 2022.
Yao, S., et al. (2023). Tree of thoughts. NeurIPS 2023.
Zou, A., et al. (2023). Representation engineering. arXiv:2310.01405.
Zou, A., et al. (2023b). Universal adversarial attacks on aligned LMs. arXiv:2307.15043.
⚠ DRAFT — Hugging Face Spaces · Static HTML · May 2026