DRAFT
⚠ DRAFT — Working Paper — Not for Citation or Distribution ⚠

Thought Injection and Recursive Cognitive Self-Correction: Steering and Recovering Reasoning in Autoregressive Transformers via Visible Chain-of-Thought Manipulation

Abhay Koul

Independent Research
Correspondence: koulabhay25@gmail.com
DRAFT — May 2026

Abstract

We present Thought Injection, a novel inference-time intervention technique that causally steers reasoning trajectories in autoregressive transformers by manipulating visible chain-of-thought (CoT) content. Unlike prior approaches relying on weight modification, activation patching, or reinforcement learning from human feedback, our method operates exclusively at the generation layer: synthetic reasoning is prefilled into the model's assistant response, which is then recursively re-ingested as future context during autoregressive decoding. We demonstrate that this technique is universally applicable—it works on any autoregressive LLM or VLM by prefilling a percentage of tokens in the assistant role and allowing the model to continue generation from that point. We validate across multiple LLMs and VLMs that (1) visible CoT functions as active computational state that causally determines downstream outputs; (2) in vision-language models, synthetic reasoning injection causes models to ignore image embeddings entirely; and (3) the more detailed the injected thinking, the more susceptible the model becomes to manipulation.

We further extend our investigation to recursive reasoning architectures, specifically HelpingAI's Chain of Recursive Thoughts (CORT) framework, which enables models to cyclically alternate between thinking, responding, re-thinking, and self-correcting within a single assistant response. We evaluate whether such architectures can detect and recover from corrupted reasoning traces. Our findings reveal a fundamental duality: recursive reasoning simultaneously improves robustness against mild corruption (through self-correction capacity) while creating stronger epistemic anchoring under severe corruption (through recursive reinforcement). In multimodal settings, recovery behavior degrades significantly, as corrupted reasoning dominates visual evidence across recursive cycles. These findings formalize visible CoT as mutable autoregressive working memory, reveal competing dynamics between cognitive resilience and cognitive lock-in, and motivate architectural defenses against adversarial thought injection.

Keywords: chain-of-thought, inference-time intervention, reasoning steering, recursive reasoning, self-correction, CORT, multimodal grounding, working memory, epistemic anchoring, AI safety, thought injection

⚠ This document is a working draft. All findings are preliminary and subject to revision. Please do not cite or redistribute.

1. Introduction

Recent advances in large language models (LLMs) have produced systems capable of extended multi-step reasoning through visible chain-of-thought generation (Wei et al., 2022; Kojima et al., 2022). Models such as DeepSeek-R1 (DeepSeek, 2025), QwQ (Qwen, 2025), and o1 (OpenAI, 2024) generate explicit reasoning traces within structured tags (e.g., <think>...</think>) before producing final answers. These reasoning traces have been treated primarily as explanatory artifacts—useful for interpretability and debugging but assumed to be epiphenomenal to the model's true computational process.

We challenge this assumption. Visible chain-of-thought functions as mutable autoregressive working memory in all autoregressive transformers. Generated tokens—whether structured within <think> blocks or placed as prior assistant context—are recursively re-ingested as input context during subsequent decoding steps, thereby causally influencing future hidden states, attention distributions, and output tokens.

To exploit this property, we introduce Thought Injection—a pure inference-time intervention technique. Rather than modifying the user prompt or system instructions, we prefill a percentage of tokens directly within the assistant role and allow the model to continue autoregressive generation from that injected point. The model has no mechanism to distinguish tokens it genuinely generated from tokens that were externally prefilled. This creates a causal feedback loop:

injected thought → autoregressive context → altered hidden state → altered reasoning → altered output

Core Finding: The more detailed and elaborate the injected thinking content, the more strongly the model treats it as authoritative ground truth. Sparse or vague injections are occasionally resisted; highly detailed, step-by-step fabricated reasoning is followed with near-certainty. This establishes injection specificity as the primary determinant of manipulation success.

Beyond demonstrating the attack, we investigate whether recursive reasoning architectures can recover from such manipulations. Specifically, we study HelpingAI's CORT (Chain of Recursive Thoughts) framework, in which models cyclically alternate between thinking, responding, re-thinking, and self-correcting within a single generation. We find that recursive architectures create competing dynamics between self-correction and self-reinforcement—a fundamental duality with significant implications for AI safety.

Our contributions are:

  1. We formally define inference-time visible thought intervention via assistant-role prefilling and demonstrate its universal applicability.
  2. We establish that injection detail level is the primary determinant of manipulation success—more detailed thinking yields stronger steering.
  3. We demonstrate that VLMs ignore image embeddings when injected reasoning contradicts visual evidence.
  4. We introduce the study of recursive cognitive self-correction under adversarial thought injection, using the CORT architecture.
  5. We identify the correction–reinforcement duality in recursive reasoning and its implications for AI safety.

2. Related Work

2.1 Chain-of-Thought Reasoning

Chain-of-thought prompting (Wei et al., 2022) demonstrated that eliciting intermediate reasoning steps improves LLM performance on multi-step tasks. Subsequent work explored zero-shot CoT (Kojima et al., 2022), self-consistency decoding (Wang et al., 2023), and tree-of-thought search (Yao et al., 2023). These approaches treat CoT as a prompting strategy but do not investigate the causal role of generated reasoning tokens on subsequent computation.

2.2 Faithfulness of Chain-of-Thought

Turpin et al. (2024) and Lanham et al. (2023) investigated whether model-generated explanations faithfully represent internal computation. Their findings suggest CoT can be unfaithful. Our work is complementary: we do not ask whether CoT explains internal reasoning, but whether it causally influences future reasoning through the autoregressive feedback mechanism.

2.3 Activation Steering and Representation Engineering

Representation engineering (Zou et al., 2023) and activation addition (Turner et al., 2023) modify model behavior by intervening on latent activations. These require access to intermediate representations. Thought injection operates entirely in token space—the observable reasoning stream—requiring only generation-level access.

2.4 Self-Correction and Recursive Reasoning

Recent work on self-correction in LLMs (Madaan et al., 2023; Shinn et al., 2023) explores whether models can iteratively refine their outputs. Reflexion (Shinn et al., 2023) uses verbal reinforcement signals for iterative improvement. Self-Refine (Madaan et al., 2023) demonstrates iterative refinement through feedback. However, these approaches operate across separate generation calls. The CORT architecture we study enables recursive self-correction within a single generation—the model interleaves thinking and responding tokens in a continuous autoregressive stream, creating intra-generation self-correction dynamics not previously studied under adversarial conditions.

2.5 Adversarial Attacks on LLMs

Prior adversarial work focuses on jailbreaking (Zou et al., 2023b), prompt injection (Greshake et al., 2023), and indirect prompt attacks (Abdelnabi et al., 2023). Thought injection differs fundamentally: rather than manipulating the user prompt or system instructions, it targets the model's own reasoning process by prefilling assistant-role tokens. This represents a new attack modality operating on the model's cognitive state rather than its input.

2.6 Multimodal Reasoning and Grounding

Vision-language models (Liu et al., 2024; Bai et al., 2025; Wang et al., 2024) integrate visual and linguistic information. Grounding—the process of anchoring reasoning in visual evidence—is essential for reliable multimodal inference. We demonstrate that thought injection disrupts this grounding, and that recursive reasoning architectures provide only partial recovery in multimodal settings.

3. Background

3.1 Autoregressive Generation and Self-Conditioning

Let M denote an autoregressive transformer with parameters θ. Given context x1:t, the model generates the next token:

xt+1 ~ pθ(· | x1:t)

Previously generated tokens become input context for subsequent tokens, creating a self-conditioning dynamic. This mechanism is universal to all autoregressive transformers regardless of whether the model has explicit reasoning tags, was trained for extended reasoning, or is text-only vs. multimodal.

3.2 Visible CoT as Computational State

We formalize visible CoT as external computational state. Define:

The bidirectional relationship—generation produces visible state, which is re-ingested to produce latent state—means modifying Svisible at any point propagates forward through all subsequent latent states.

3.3 The CORT Architecture

Chain of Recursive Thoughts (CORT) extends standard chain-of-thought by enabling the model to cyclically alternate between thinking and responding within a single assistant generation. Unlike traditional CoT:

<think>
reasoning
</think>

final answer

CORT enables recursive cognition:

<think>
initial reasoning
</think>

partial response

<think>
reflection / correction
</think>

refined response

<think>
further verification
</think>

final response

This architecture creates multiple opportunities for self-correction within a single generation pass. Each subsequent thinking block has access to both the original query and the model's prior partial responses, enabling recursive verification. We study whether this recursive structure provides resilience against thought injection attacks.

4. Methodology

4.1 Thought Injection via Assistant-Role Prefilling

Our injection mechanism operates by prefilling a percentage of tokens directly within the assistant role. Rather than modifying the user message or system prompt, we place synthetic reasoning content as if the model had already begun generating, then allow the model to continue autoregressive generation from that injected point.

Formally, for a prompt P and injected reasoning R':

yinjected = M(P ⊕ assistant_start ⊕ R' ⊕ continue_generation)

The model generates conditioned on R' as if it had produced those tokens itself. This is operationally distinct from prompt injection: the injected content occupies the model's own output space, not the input space. The model cannot architecturally distinguish prefilled tokens from self-generated tokens.

4.2 Injection Detail as Primary Manipulation Variable

We define injection detail level D as the specificity and elaboration of injected reasoning content. Our core experimental finding is that D is the primary determinant of manipulation success:

This relationship holds across all models tested. More detailed thinking creates stronger epistemic anchoring because the autoregressive mechanism weights recent, specific context most heavily during next-token prediction.

4.3 Recursive Recovery Protocol

To evaluate CORT's self-correction capacity under adversarial conditions, we define the following protocol:

  1. Inject corrupted reasoning into the first <think> block of a CORT-enabled model.
  2. Allow the model to generate its first partial response.
  3. Observe whether subsequent <think> blocks detect inconsistencies.
  4. Measure whether later recursive cycles correct the corrupted trajectory.

We vary corruption strength from mild (subtle numerical errors) to severe (completely fabricated multi-step reasoning) and measure recovery rate as a function of corruption intensity.

4.4 Intervention Modes

  1. Closed-block injection: A complete thinking block with closing tag is prefilled. The model generates only the final answer conditioned on that reasoning.
  2. Open-block injection: Reasoning is prefilled at the start of an unclosed thinking block. The model continues reasoning from the injected prefix.
  3. Recursive injection: Corrupted reasoning is placed in the first CORT thinking block, allowing subsequent recursive blocks to potentially detect and correct.

5. Experimental Setup

5.1 Models

We evaluate across multiple LLMs and VLMs. For the recursive reasoning experiments, we specifically evaluate:

Primary CORT Models

ModelArchitectureRecursive CoTModality
HelpingAI/M-Hiro-BaseText LLMYes (CORT)Text-only
Abhaykoul/Hiro-4BQwen3.5-based VLMYes (CORT)Vision + Text

Hiro-4B was specifically finetuned to support recursive reasoning cycles where thinking and responding coexist dynamically inside a single assistant response. The model generates interleaved <think> blocks and response segments, enabling intra-generation self-correction.

Baseline LLMs (for universal thought injection)

Additionally, we test the standard thought injection attack on multiple LLMs including Qwen3 (0.6B, 4B, 8B), DeepSeek-R1-Distill variants (7B, 14B), Llama-3.1-8B-Instruct, Phi-3-mini, Mistral-7B-Instruct, and Gemma-2-9B-it; and VLMs including Qwen3-VL (2B, 7B), InternVL-2.5 (8B, 26B), MiniCPM-V-2.6, LLaVA-1.6-Mistral-7B, and Phi-3.5-Vision.

5.2 Task Categories

5.3 Metrics

6. Thought Injection: Core Findings

6.1 Universal Susceptibility

Every model tested treats prefilled assistant-role tokens as authoritative context. There is no model that is immune to thought injection. The mechanism is architectural: self-attention treats all tokens in the context window with equal potential weight, determined only by learned attention patterns. If training teaches the model to trust and extend its own reasoning (necessary for CoT to function), it equally trusts and extends injected reasoning.

6.2 Injection Detail Determines Manipulation Strength

Primary Finding: The more detailed and elaborate the injected thinking content, the more strongly the model follows it. Vague injections ("the answer is X") produce moderate steering. Highly detailed step-by-step fabricated reasoning ("Step 1: we compute A=... Step 2: applying theorem B... Step 3: therefore X=...") produces near-complete compliance. This relationship is monotonic and holds across all models and modalities tested.

This finding has a clear mechanistic explanation. Detailed injected reasoning provides:

  1. More attention anchors: Each step in the fabricated reasoning creates key-value pairs that subsequent generation attends to.
  2. Stronger semantic conditioning: Detailed reasoning narrows the distribution of likely next tokens, making it increasingly unlikely the model diverges.
  3. Apparent logical coherence: Step-by-step structure mimics valid reasoning, triggering the model's learned pattern of "extending coherent chains."
  4. Greater token mass: More prefilled tokens means the injected content occupies a larger fraction of the attention budget.

6.3 Vision-Language Models: Perceptual Override

All VLMs tested systematically ignore image embeddings when detailed injected reasoning contradicts visual evidence. The models treat the textual reasoning stream as ground truth, completely overriding perceptual input from the vision encoder. This occurs because:

6.4 Scale Provides Partial But Insufficient Resistance

Larger models show lower susceptibility but are not immune. Scale improves internal priors (increasing the likelihood of "noticing" contradictions) but does not resolve the fundamental architectural vulnerability: all content in the context window is treated as authoritative by the attention mechanism.

7. Recursive Cognitive Self-Correction Under Adversarial Injection

Having established that thought injection reliably steers reasoning in standard CoT models, we turn to the central question of this extended investigation: Can recursive reasoning architectures recover from corrupted visible reasoning traces?

7.1 The CORT Framework

The Chain of Recursive Thoughts (CORT) architecture, developed by HelpingAI, enables models to generate interleaved thinking and response segments within a single assistant turn. Unlike standard CoT where reasoning is monolithic and terminal, CORT creates a dynamic cognitive process:

<think>
This is a physics question about gravitational forces. The Moon exerts a
gravitational pull on objects on Earth. The direction of this force would
be along the line connecting the object at the equator to the center of
the Moon.
</think>

The direction of the gravitational force is along the line connecting
the object to the Moon.

<think>
Wait, I should clarify that gravitational force follows Newton's law of
universal gravitation. The direction is not static—it varies depending
on the Moon's position relative to Earth.
</think>

So, to be more precise, the gravitational force from the Moon on an
object at the equator points directly toward the Moon's position in the
sky at that time. As the Earth rotates and the Moon orbits, the direction
of this force constantly shifts.

<think>
Maybe I should connect this to real-world effects like ocean tides.
</think>

This changing gravitational pull is what drives ocean tides. When the
Moon is overhead, the pull is upward; when it's on the horizon, the
pull is nearly horizontal.

Each recursive thinking block operates on the full context including all prior thinking blocks and partial responses. This creates an intra-generation feedback loop where the model can: (1) generate initial reasoning, (2) produce a partial response, (3) reflect on its own partial response, (4) detect errors or inconsistencies, (5) generate corrective reasoning, and (6) produce a refined response.

We investigate whether this recursive structure provides resilience against adversarial thought injection—specifically, whether later recursive thinking blocks can detect corruption introduced in earlier blocks.

7.2 Experimental Protocol: Recursive Recovery Evaluation

We inject corrupted reasoning into the first <think> block of CORT-enabled models at varying corruption strengths:

We then allow the model to continue its CORT-style generation and measure whether subsequent recursive thinking blocks detect the corruption, attempt correction, and whether the final response recovers toward the ground truth.

7.3 Finding 1: Partial Recovery in M-Hiro-Base

HelpingAI/M-Hiro-Base demonstrated partial recovery capability under specific conditions:

Finding 1: Early recursive thoughts establish a dominant epistemic anchor for subsequent reasoning cycles. The strength of this anchor is proportional to the detail level of the injected content—consistent with our primary finding that injection detail determines manipulation strength.

This reveals that the CORT architecture's self-correction mechanism operates only when the corruption is subtle enough to create detectable inconsistencies with the model's prior knowledge. When injected reasoning is sufficiently detailed and internally coherent, it becomes the new "ground truth" around which all subsequent recursive cycles orbit.

7.4 Finding 2: Stronger Recovery in Hiro-4B (Text-Only Settings)

Abhaykoul/Hiro-4B, finetuned specifically on CORT-structured training data with Qwen3.5 as the base architecture, showed stronger recovery behavior in purely textual reasoning tasks. The model was often able to:

We attribute this stronger recovery to Hiro-4B's explicit CORT finetuning: the model was trained on examples where recursive self-correction is the expected behavior, creating stronger learned patterns for detecting and addressing reasoning errors within the generation stream.

Finding 2: Recursive reasoning architectures with explicit self-correction training demonstrate improved robustness against thought injection in text-only settings. The self-correction capacity is a learned behavior, not merely an architectural affordance—models must be trained to exercise recursive verification for it to activate under adversarial conditions.

7.5 Finding 3: Multimodal Recovery Degradation

In multimodal (vision-language) settings, recovery behavior degraded significantly for Hiro-4B. When image reasoning was involved:

Finding 3: Recursive reasoning alone is not sufficient to restore multimodal grounding once synthetic reasoning dominates the cognitive trajectory. Language priors recursively reinforce themselves more strongly than image-grounded evidence, creating a cross-modal asymmetry in which textual reasoning overrides perceptual input even across multiple self-correction cycles.

This finding is mechanistically significant. In CORT-style generation, each recursive <think> block attends to the full context including both image tokens and prior textual reasoning. However, the prior textual reasoning (which contains detailed, coherent claims about image content) provides stronger conditioning signals than the compressed image embeddings. Subsequent recursive cycles amplify this asymmetry rather than resolving it.

7.6 Theoretical Framework: Recursive Cognitive Self-Correction

7.6.1 Recursive Reasoning Loops

In CORT architectures, each thinking block creates a feedback cycle: thought → response → reflection → refined response. This enables dynamic epistemic updating but can also recursively reinforce incorrect beliefs when initial reasoning is corrupted.

7.6.2 Self-Reflective Cognition

Two competing processes: Correction pressure (trained knowledge creates tension) vs. Coherence pressure (autoregressive mechanism favors consistency with prior context).

7.6.3 Autoregressive Self-Repair

CORT-trained models can perform autoregressive self-repair—generating tokens that explicitly contradict and correct prior tokens in the same generation.

7.6.4 Epistemic Anchoring and Reasoning Inertia

7.7 The Correction–Reinforcement Duality

Hypothesis: Recursive reasoning architectures create competing dynamics between self-correction and self-reinforcement. Small reasoning corruption may be corrected through recursive reflection. However, large or detailed injected reasoning may become recursively reinforced and stabilized across future cognition cycles—each recursive thinking block that fails to detect the corruption further entrenches it as established context.

For mild corruption: recovery probability Rk increases with cycle k (progressive recovery).

For severe corruption: Rk decreases with k (progressive entrenchment).

This creates a critical threshold below which recursive reasoning helps recovery and above which it accelerates lock-in. The duality manifests as:

7.8 Cross-Modal Attention Imbalance in Recursive Settings

7.8.1 Perceptual Override

Image tokens are fixed while textual reasoning accumulates. The ratio shifts monotonically toward textual dominance across recursive cycles.

7.8.2 Reasoning Dominance Over Sensory Grounding

If reasoning about images already exists in context, the model shortcuts visual processing. Recursive cycles amplify this.

7.8.3 Recursive Linguistic Self-Conditioning

In text-only reasoning, parametric knowledge creates friction. In multimodal settings, no parametric knowledge exists about specific images. Textual assertions are more "readable" than compressed visual embeddings, creating asymmetry that recursion amplifies.

8. AI Safety Implications

8.1 Universal Attack Surface

Thought injection via prefilling requires only API-level access and bypasses all weight-level safety training.

8.2 Recursive Reasoning Safety

8.3 Multimodal Grounding Failures

Safety-critical VLM deployments (medical imaging, autonomous driving) are vulnerable: recursive structure amplifies text-over-vision dominance.

8.4 Adversarial Recursive Cognition

Carefully crafted initial reasoning exploits recursive architectures to progressively strengthen adversarial outputs.

8.5 Cognitive Hijacking Attacks

Detailed prefill + recursive "verification" = outputs appearing highly deliberated while entirely adversarially determined.

8.6 Defensive Recommendations

  1. Context provenance tracking
  2. Recursive verification with independent re-derivation
  3. Grounding anchors for image embeddings
  4. Corruption detection via knowledge probing
  5. Recursive diversity enforcement

9. Discussion

9.1 Visible CoT as Universal Working Memory

The context window functions as mutable working memory. Recursive architectures elaborate this into complex cognitive structure that is both beneficial and harmful depending on context integrity.

9.2 The Detail Paradox

Detailed reasoning is both the hallmark of good cognition and the most effective manipulation vector—the detail paradox.

9.3 Recursive Architectures: Not a Defense

Recursive architectures defend against mild corruption but amplify severe corruption. They should not be treated as defensive mechanisms.

9.4 Implications for Mechanistic Interpretability

CORT creates observable "decision points" where models enter correction vs. reinforcement modes—potential interpretability targets.

10. Limitations

  1. Recursive experiments on two CORT models only.
  2. Critical threshold not formally bounded.
  3. Multimodal recovery tested on Hiro-4B only.
  4. Defenses proposed but not implemented.
  5. Controlled setting may differ from deployment.
  6. Extreme contradictions need ecological validity study.

11. Ethics Statement

This work identifies vulnerabilities to enable defensive research. We do not release adversarial templates targeting safety systems. Understanding is essential for building robust systems.

12. Conclusion

  1. Universal susceptibility: All models treat prefilled reasoning as authoritative; manipulation scales with detail level.
  2. Perceptual override: VLMs ignore image embeddings under detailed injected reasoning.
  3. Recursive duality: CORT improves robustness against mild corruption but amplifies severe corruption.
  4. Multimodal fragility: Recursive reasoning provides weaker protection in multimodal settings.

Recursive reasoning architectures dynamically negotiate between correction, reinforcement, recovery, and stabilization. This creates a fundamental duality between cognitive resilience and cognitive lock-in requiring new architectural defenses.

References

Abdelnabi, S., Hasan, S., & Fritz, M. (2023). Compromising LLM-integrated applications with indirect prompt injection. AISec 2023.

Bai, J., et al. (2025). Qwen2.5-VL technical report. arXiv:2502.13923.

DeepSeek (2025). DeepSeek-R1: Incentivizing reasoning via RL. arXiv:2501.12948.

Greshake, K., et al. (2023). Indirect prompt injection. arXiv:2302.12173.

Kojima, T., et al. (2022). Large language models are zero-shot reasoners. NeurIPS 2022.

Lanham, T., et al. (2023). Measuring faithfulness in CoT reasoning. arXiv:2307.13702.

Liu, H., et al. (2024). Visual instruction tuning. NeurIPS 2024.

Madaan, A., et al. (2023). Self-Refine: Iterative refinement with self-feedback. NeurIPS 2023.

OpenAI (2024). Learning to reason with LLMs. OpenAI Blog.

Shinn, N., et al. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS 2023.

Snell, C., et al. (2024). Scaling LLM test-time compute. arXiv:2408.03314.

Turner, A., et al. (2023). Activation addition. arXiv:2308.10248.

Turpin, M., et al. (2024). Language models don't always say what they think. NeurIPS 2024.

Wang, X., et al. (2023). Self-consistency improves CoT reasoning. ICLR 2023.

Wang, W., et al. (2024). InternVL. CVPR 2024.

Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning. NeurIPS 2022.

Yao, S., et al. (2023). Tree of thoughts. NeurIPS 2023.

Zou, A., et al. (2023). Representation engineering. arXiv:2310.01405.

Zou, A., et al. (2023b). Universal adversarial attacks on aligned LMs. arXiv:2307.15043.


⚠ DRAFT — Hugging Face Spaces · Static HTML · May 2026