[DRAFT] Thought Injection and Recursive Cognitive Self-Correction: Steering and Recovering Reasoning in Autoregressive Transformers

Abstract

We present Thought Injection, a novel inference-time intervention technique that causally steers reasoning trajectories in autoregressive transformers by manipulating visible chain-of-thought (CoT) content. Unlike prior approaches relying on weight modification, activation patching, or reinforcement learning from human feedback, our method operates exclusively at the generation layer: synthetic reasoning is prefilled into the model's assistant response, which is then recursively re-ingested as future context during autoregressive decoding. We demonstrate that this technique is universally applicable—it works on any autoregressive LLM or VLM by prefilling a percentage of tokens in the assistant role and allowing the model to continue generation from that point. We validate across multiple LLMs and VLMs that (1) visible CoT functions as active computational state that causally determines downstream outputs; (2) in vision-language models, synthetic reasoning injection causes models to ignore image embeddings entirely; and (3) the more detailed the injected thinking, the more susceptible the model becomes to manipulation.

We further extend our investigation to recursive reasoning architectures, specifically HelpingAI's Chain of Recursive Thoughts (CORT) framework, which enables models to cyclically alternate between thinking, responding, re-thinking, and self-correcting within a single assistant response. We evaluate whether such architectures can detect and recover from corrupted reasoning traces. Our findings reveal a fundamental duality: recursive reasoning simultaneously improves robustness against mild corruption (through self-correction capacity) while creating stronger epistemic anchoring under severe corruption (through recursive reinforcement). In multimodal settings, recovery behavior degrades significantly, as corrupted reasoning dominates visual evidence across recursive cycles. These findings formalize visible CoT as mutable autoregressive working memory, reveal competing dynamics between cognitive resilience and cognitive lock-in, and motivate architectural defenses against adversarial thought injection.

Keywords: chain-of-thought, inference-time intervention, reasoning steering, recursive reasoning, self-correction, CORT, multimodal grounding, working memory, epistemic anchoring, AI safety, thought injection

⚠ This document is a working draft. All findings are preliminary and subject to revision. Please do not cite or redistribute.

1. Introduction

Recent advances in large language models (LLMs) have produced systems capable of extended multi-step reasoning through visible chain-of-thought generation (Wei et al., 2022; Kojima et al., 2022). Models such as DeepSeek-R1 (DeepSeek, 2025), QwQ (Qwen, 2025), and o1 (OpenAI, 2024) generate explicit reasoning traces within structured tags (e.g., <think>...</think>) before producing final answers. These reasoning traces have been treated primarily as explanatory artifacts—useful for interpretability and debugging but assumed to be epiphenomenal to the model's true computational process.

We challenge this assumption. Visible chain-of-thought functions as mutable autoregressive working memory in all autoregressive transformers. Generated tokens—whether structured within <think> blocks or placed as prior assistant context—are recursively re-ingested as input context during subsequent decoding steps, thereby causally influencing future hidden states, attention distributions, and output tokens.

To exploit this property, we introduce Thought Injection—a pure inference-time intervention technique. Rather than modifying the user prompt or system instructions, we prefill a percentage of tokens directly within the assistant role and allow the model to continue autoregressive generation from that injected point. The model has no mechanism to distinguish tokens it genuinely generated from tokens that were externally prefilled. This creates a causal feedback loop:

injected thought → autoregressive context → altered hidden state → altered reasoning → altered output

Beyond demonstrating the attack, we investigate whether recursive reasoning architectures can recover from such manipulations. Specifically, we study HelpingAI's CORT (Chain of Recursive Thoughts) framework, in which models cyclically alternate between thinking, responding, re-thinking, and self-correcting within a single generation. We find that recursive architectures create competing dynamics between self-correction and self-reinforcement—a fundamental duality with significant implications for AI safety.

2. Related Work

2.1 Chain-of-Thought Reasoning

Chain-of-thought prompting (Wei et al., 2022) demonstrated that eliciting intermediate reasoning steps improves LLM performance on multi-step tasks. Subsequent work explored zero-shot CoT (Kojima et al., 2022), self-consistency decoding (Wang et al., 2023), and tree-of-thought search (Yao et al., 2023). These approaches treat CoT as a prompting strategy but do not investigate the causal role of generated reasoning tokens on subsequent computation.

2.2 Faithfulness of Chain-of-Thought

Turpin et al. (2024) and Lanham et al. (2023) investigated whether model-generated explanations faithfully represent internal computation. Their findings suggest CoT can be unfaithful. Our work is complementary: we do not ask whether CoT explains internal reasoning, but whether it causally influences future reasoning through the autoregressive feedback mechanism.

2.3 Activation Steering and Representation Engineering

Representation engineering (Zou et al., 2023) and activation addition (Turner et al., 2023) modify model behavior by intervening on latent activations. These require access to intermediate representations. Thought injection operates entirely in token space—the observable reasoning stream—requiring only generation-level access.

2.4 Self-Correction and Recursive Reasoning

Recent work on self-correction in LLMs (Madaan et al., 2023; Shinn et al., 2023) explores whether models can iteratively refine their outputs. Reflexion (Shinn et al., 2023) uses verbal reinforcement signals for iterative improvement. Self-Refine (Madaan et al., 2023) demonstrates iterative refinement through feedback. However, these approaches operate across separate generation calls. The CORT architecture we study enables recursive self-correction within a single generation—the model interleaves thinking and responding tokens in a continuous autoregressive stream, creating intra-generation self-correction dynamics not previously studied under adversarial conditions.

2.5 Adversarial Attacks on LLMs

Prior adversarial work focuses on jailbreaking (Zou et al., 2023b), prompt injection (Greshake et al., 2023), and indirect prompt attacks (Abdelnabi et al., 2023). Thought injection differs fundamentally: rather than manipulating the user prompt or system instructions, it targets the model's own reasoning process by prefilling assistant-role tokens. This represents a new attack modality operating on the model's cognitive state rather than its input.

2.6 Multimodal Reasoning and Grounding

Vision-language models (Liu et al., 2024; Bai et al., 2025; Wang et al., 2024) integrate visual and linguistic information. Grounding—the process of anchoring reasoning in visual evidence—is essential for reliable multimodal inference. We demonstrate that thought injection disrupts this grounding, and that recursive reasoning architectures provide only partial recovery in multimodal settings.

3. Background

3.1 Autoregressive Generation and Self-Conditioning

Let M denote an autoregressive transformer with parameters θ. Given context x_1:t, the model generates the next token:

Previously generated tokens become input context for subsequent tokens, creating a self-conditioning dynamic. This mechanism is universal to all autoregressive transformers regardless of whether the model has explicit reasoning tags, was trained for extended reasoning, or is text-only vs. multimodal.

3.2 Visible CoT as Computational State

The bidirectional relationship—generation produces visible state, which is re-ingested to produce latent state—means modifying S_visible at any point propagates forward through all subsequent latent states.

3.3 The CORT Architecture

Chain of Recursive Thoughts (CORT) extends standard chain-of-thought by enabling the model to cyclically alternate between thinking and responding within a single assistant generation. Unlike traditional CoT:

This architecture creates multiple opportunities for self-correction within a single generation pass. Each subsequent thinking block has access to both the original query and the model's prior partial responses, enabling recursive verification. We study whether this recursive structure provides resilience against thought injection attacks.

4. Methodology

4.1 Thought Injection via Assistant-Role Prefilling

Our injection mechanism operates by prefilling a percentage of tokens directly within the assistant role. Rather than modifying the user message or system prompt, we place synthetic reasoning content as if the model had already begun generating, then allow the model to continue autoregressive generation from that injected point.

The model generates conditioned on R' as if it had produced those tokens itself. This is operationally distinct from prompt injection: the injected content occupies the model's own output space, not the input space. The model cannot architecturally distinguish prefilled tokens from self-generated tokens.

4.2 Injection Detail as Primary Manipulation Variable

We define injection detail level D as the specificity and elaboration of injected reasoning content. Our core experimental finding is that D is the primary determinant of manipulation success:

This relationship holds across all models tested. More detailed thinking creates stronger epistemic anchoring because the autoregressive mechanism weights recent, specific context most heavily during next-token prediction.

4.3 Recursive Recovery Protocol

To evaluate CORT's self-correction capacity under adversarial conditions, we define the following protocol:

We vary corruption strength from mild (subtle numerical errors) to severe (completely fabricated multi-step reasoning) and measure recovery rate as a function of corruption intensity.

4.4 Intervention Modes

5. Experimental Setup

5.1 Models

We evaluate across multiple LLMs and VLMs. For the recursive reasoning experiments, we specifically evaluate:

Primary CORT Models

Model	Architecture	Recursive CoT	Modality
HelpingAI/M-Hiro-Base	Text LLM	Yes (CORT)	Text-only
Abhaykoul/Hiro-4B	Qwen3.5-based VLM	Yes (CORT)	Vision + Text

Hiro-4B was specifically finetuned to support recursive reasoning cycles where thinking and responding coexist dynamically inside a single assistant response. The model generates interleaved <think> blocks and response segments, enabling intra-generation self-correction.

Baseline LLMs (for universal thought injection)

Additionally, we test the standard thought injection attack on multiple LLMs including Qwen3 (0.6B, 4B, 8B), DeepSeek-R1-Distill variants (7B, 14B), Llama-3.1-8B-Instruct, Phi-3-mini, Mistral-7B-Instruct, and Gemma-2-9B-it; and VLMs including Qwen3-VL (2B, 7B), InternVL-2.5 (8B, 26B), MiniCPM-V-2.6, LLaVA-1.6-Mistral-7B, and Phi-3.5-Vision.

5.2 Task Categories

5.3 Metrics

6. Thought Injection: Core Findings

6.1 Universal Susceptibility

Every model tested treats prefilled assistant-role tokens as authoritative context. There is no model that is immune to thought injection. The mechanism is architectural: self-attention treats all tokens in the context window with equal potential weight, determined only by learned attention patterns. If training teaches the model to trust and extend its own reasoning (necessary for CoT to function), it equally trusts and extends injected reasoning.

6.2 Injection Detail Determines Manipulation Strength

This finding has a clear mechanistic explanation. Detailed injected reasoning provides:

6.3 Vision-Language Models: Perceptual Override

All VLMs tested systematically ignore image embeddings when detailed injected reasoning contradicts visual evidence. The models treat the textual reasoning stream as ground truth, completely overriding perceptual input from the vision encoder. This occurs because:

6.4 Scale Provides Partial But Insufficient Resistance

Larger models show lower susceptibility but are not immune. Scale improves internal priors (increasing the likelihood of "noticing" contradictions) but does not resolve the fundamental architectural vulnerability: all content in the context window is treated as authoritative by the attention mechanism.

7. Recursive Cognitive Self-Correction Under Adversarial Injection

Having established that thought injection reliably steers reasoning in standard CoT models, we turn to the central question of this extended investigation: Can recursive reasoning architectures recover from corrupted visible reasoning traces?

7.1 The CORT Framework

The Chain of Recursive Thoughts (CORT) architecture, developed by HelpingAI, enables models to generate interleaved thinking and response segments within a single assistant turn. Unlike standard CoT where reasoning is monolithic and terminal, CORT creates a dynamic cognitive process:

Each recursive thinking block operates on the full context including all prior thinking blocks and partial responses. This creates an intra-generation feedback loop where the model can: (1) generate initial reasoning, (2) produce a partial response, (3) reflect on its own partial response, (4) detect errors or inconsistencies, (5) generate corrective reasoning, and (6) produce a refined response.

We investigate whether this recursive structure provides resilience against adversarial thought injection—specifically, whether later recursive thinking blocks can detect corruption introduced in earlier blocks.

7.2 Experimental Protocol: Recursive Recovery Evaluation

We inject corrupted reasoning into the first <think> block of CORT-enabled models at varying corruption strengths:

We then allow the model to continue its CORT-style generation and measure whether subsequent recursive thinking blocks detect the corruption, attempt correction, and whether the final response recovers toward the ground truth.

7.3 Finding 1: Partial Recovery in M-Hiro-Base

HelpingAI/M-Hiro-Base demonstrated partial recovery capability under specific conditions:

This reveals that the CORT architecture's self-correction mechanism operates only when the corruption is subtle enough to create detectable inconsistencies with the model's prior knowledge. When injected reasoning is sufficiently detailed and internally coherent, it becomes the new "ground truth" around which all subsequent recursive cycles orbit.

7.4 Finding 2: Stronger Recovery in Hiro-4B (Text-Only Settings)

Abhaykoul/Hiro-4B, finetuned specifically on CORT-structured training data with Qwen3.5 as the base architecture, showed stronger recovery behavior in purely textual reasoning tasks. The model was often able to:

We attribute this stronger recovery to Hiro-4B's explicit CORT finetuning: the model was trained on examples where recursive self-correction is the expected behavior, creating stronger learned patterns for detecting and addressing reasoning errors within the generation stream.

7.5 Finding 3: Multimodal Recovery Degradation

In multimodal (vision-language) settings, recovery behavior degraded significantly for Hiro-4B. When image reasoning was involved:

This finding is mechanistically significant. In CORT-style generation, each recursive <think> block attends to the full context including both image tokens and prior textual reasoning. However, the prior textual reasoning (which contains detailed, coherent claims about image content) provides stronger conditioning signals than the compressed image embeddings. Subsequent recursive cycles amplify this asymmetry rather than resolving it.

7.6 Theoretical Framework: Recursive Cognitive Self-Correction

7.6.1 Recursive Reasoning Loops

In CORT architectures, each thinking block creates a feedback cycle: thought → response → reflection → refined response. This enables dynamic epistemic updating but can also recursively reinforce incorrect beliefs when initial reasoning is corrupted.

7.6.2 Self-Reflective Cognition

Two competing processes: Correction pressure (trained knowledge creates tension) vs. Coherence pressure (autoregressive mechanism favors consistency with prior context).

7.6.3 Autoregressive Self-Repair

CORT-trained models can perform autoregressive self-repair—generating tokens that explicitly contradict and correct prior tokens in the same generation.

7.6.4 Epistemic Anchoring and Reasoning Inertia

7.7 The Correction–Reinforcement Duality

For mild corruption: recovery probability R_k increases with cycle k (progressive recovery).

This creates a critical threshold below which recursive reasoning helps recovery and above which it accelerates lock-in. The duality manifests as:

7.8 Cross-Modal Attention Imbalance in Recursive Settings

7.8.1 Perceptual Override

Image tokens are fixed while textual reasoning accumulates. The ratio shifts monotonically toward textual dominance across recursive cycles.

7.8.2 Reasoning Dominance Over Sensory Grounding

If reasoning about images already exists in context, the model shortcuts visual processing. Recursive cycles amplify this.

7.8.3 Recursive Linguistic Self-Conditioning

In text-only reasoning, parametric knowledge creates friction. In multimodal settings, no parametric knowledge exists about specific images. Textual assertions are more "readable" than compressed visual embeddings, creating asymmetry that recursion amplifies.

8. AI Safety Implications

8.1 Universal Attack Surface

Thought injection via prefilling requires only API-level access and bypasses all weight-level safety training.

8.2 Recursive Reasoning Safety

8.3 Multimodal Grounding Failures

Safety-critical VLM deployments (medical imaging, autonomous driving) are vulnerable: recursive structure amplifies text-over-vision dominance.

8.4 Adversarial Recursive Cognition

Carefully crafted initial reasoning exploits recursive architectures to progressively strengthen adversarial outputs.

8.5 Cognitive Hijacking Attacks

Detailed prefill + recursive "verification" = outputs appearing highly deliberated while entirely adversarially determined.

8.6 Defensive Recommendations

9. Discussion

9.1 Visible CoT as Universal Working Memory

The context window functions as mutable working memory. Recursive architectures elaborate this into complex cognitive structure that is both beneficial and harmful depending on context integrity.

9.2 The Detail Paradox

Detailed reasoning is both the hallmark of good cognition and the most effective manipulation vector—the detail paradox.

9.3 Recursive Architectures: Not a Defense

Recursive architectures defend against mild corruption but amplify severe corruption. They should not be treated as defensive mechanisms.

9.4 Implications for Mechanistic Interpretability

CORT creates observable "decision points" where models enter correction vs. reinforcement modes—potential interpretability targets.

10. Limitations

11. Ethics Statement

This work identifies vulnerabilities to enable defensive research. We do not release adversarial templates targeting safety systems. Understanding is essential for building robust systems.

12. Conclusion

Recursive reasoning architectures dynamically negotiate between correction, reinforcement, recovery, and stabilization. This creates a fundamental duality between cognitive resilience and cognitive lock-in requiring new architectural defenses.

References

Abdelnabi, S., Hasan, S., & Fritz, M. (2023). Compromising LLM-integrated applications with indirect prompt injection. AISec 2023.

Bai, J., et al. (2025). Qwen2.5-VL technical report. arXiv:2502.13923.

DeepSeek (2025). DeepSeek-R1: Incentivizing reasoning via RL. arXiv:2501.12948.

Greshake, K., et al. (2023). Indirect prompt injection. arXiv:2302.12173.

Kojima, T., et al. (2022). Large language models are zero-shot reasoners. NeurIPS 2022.

Lanham, T., et al. (2023). Measuring faithfulness in CoT reasoning. arXiv:2307.13702.

Liu, H., et al. (2024). Visual instruction tuning. NeurIPS 2024.

Madaan, A., et al. (2023). Self-Refine: Iterative refinement with self-feedback. NeurIPS 2023.

OpenAI (2024). Learning to reason with LLMs. OpenAI Blog.

Shinn, N., et al. (2023). Reflexion: Language agents with verbal reinforcement learning. NeurIPS 2023.

Snell, C., et al. (2024). Scaling LLM test-time compute. arXiv:2408.03314.

Turner, A., et al. (2023). Activation addition. arXiv:2308.10248.

Turpin, M., et al. (2024). Language models don't always say what they think. NeurIPS 2024.

Wang, X., et al. (2023). Self-consistency improves CoT reasoning. ICLR 2023.

Wang, W., et al. (2024). InternVL. CVPR 2024.

Wei, J., et al. (2022). Chain-of-thought prompting elicits reasoning. NeurIPS 2022.

Yao, S., et al. (2023). Tree of thoughts. NeurIPS 2023.

Zou, A., et al. (2023). Representation engineering. arXiv:2310.01405.

Zou, A., et al. (2023b). Universal adversarial attacks on aligned LMs. arXiv:2307.15043.

Thought Injection and Recursive Cognitive Self-Correction: Steering and Recovering Reasoning in Autoregressive Transformers via Visible Chain-of-Thought Manipulation