ZTRADER · AI RESEARCH | APRIL 2026 | COGNITIVE INFRASTRUCTURE SERIES · ISSUE 01

Research Paper — Full Report

Roughly

Square

The Ugly Truth of Large Language Models

Intelligence is smooth only at low resolution.

Author DORIAN · ZTRADER.AI

Sections 07

References 18 · ALL VERIFIED

Classification PUBLIC · FREE TIER

ZTRADER AI

By Dorian · ztrader.ai

Large language models present a fundamental paradox: they exhibit remarkable fluency at scale yet collapse under precision. This paper argues that the perceived smoothness of LLM intelligence is a resolution artifact — a low-fidelity projection of an inherently discrete, approximation-bound architecture. Drawing on foundational ML theory, empirical benchmark evidence, and the epistemological literature on emergence, we decompose the gap between the ideal curve (genuine continuous reasoning) and the discrete steps that LLMs actually execute. We further examine the operational implications for practitioners in high-stakes analytical domains — particularly macroeconomic research and systematic trading. The central claim is simple and uncomfortable: the tool is brilliant, and the tool is broken, and these two facts are not in conflict.

§ 01

The Magnification Problem

The visual metaphor in our cover image is not decorative — it is architectural. A staircase of stone blocks ascends from left to right. At distance, with the naked eye, it reads as a smooth diagonal line. Introduce a magnifying lens, zoom to 10×, and the reality becomes impossible to ignore: discrete steps, hard edges, approximation error annotated in the margin, and a label that cuts to the epistemological core — emergence is an artifact of resolution.

This is not a metaphor borrowed loosely from mathematics. It is a precise description of how intelligence is currently manufactured at industrial scale. Every large language model in production today — GPT-4o, Claude 3.7 Sonnet, Gemini 1.5 Pro, Llama 3 — is, at its computational foundation, a discrete approximation engine operating on a finite vocabulary, processing tokens one distribution at a time, producing outputs that feel continuous but are architecturally quantized.^[1][2]

"The appearance of smooth intelligence is not a property of the system.
It is a property of the observer's distance from it."

The question this paper asks is not whether LLMs are powerful — they demonstrably are. The question is whether practitioners understand the structural constraints governing where that power terminates, and what happens in the approximation gap between the ideal curve they expect and the staircase they are actually standing on.

The dominant failure mode in AI deployment is not technical — it is epistemological. Users systematically overestimate model capability in proportion to the fluency of model output. Fluency is a function of training scale, not a reliable proxy for reasoning depth.

Figure 1

The Transformer Pipeline — Where Discreteness Enters and Accumulates

Source: Conceptual synthesis — Vaswani et al. (2017) [1]; Brown et al. (2020) [3]; ZTrader AI Research · ztrader.ai

§ 02

Architecture of the Staircase

2.1 The Discreteness Problem

The Transformer architecture, introduced by Vaswani et al. in 2017, fundamentally operates on sequences of discrete tokens drawn from a fixed vocabulary — typically 50,000 to 128,000 items.^[1] This tokenization is the first and hardest discretization boundary: before any computation begins, the continuous reality of language has already been quantized into a finite set of integer indices. What follows — self-attention, feed-forward projection, layer normalization, residual streaming — operates in high-dimensional continuous space. But the model cannot generate a token that does not exist in its vocabulary, cannot represent a nuance that has no token-level expression, cannot escape the grid imposed at the input boundary.^[2]

2.2 The Approximation Stack

Training a large language model is, in the most technically precise sense, a function approximation problem. Given the true distribution P(x) over natural language, the model attempts to learn a parameterized approximation P_θ(x) that minimizes cross-entropy loss over a training corpus. The gap between these two distributions — the approximation error — is not eliminable. It is reducible only asymptotically, through more parameters, more data, more compute.^[3] This error is not uniform. In high-density regions of the training distribution — common English sentences, popular code patterns, frequently discussed concepts — the approximation is extremely tight. In sparse regions — niche technical domains, precise numerical reasoning, edge-case logical structures — the approximation degrades rapidly, often catastrophically. The staircase has smooth sections and rough sections. The smooth sections are where most benchmarks live.

Dziri et al. (2023) demonstrate that GPT-4, tested on 3-digit × 3-digit multiplication, achieves approximately 59% accuracy — a task trivially solved by elementary arithmetic rules. Performance collapses exponentially as compositional complexity grows: models solve single-step operations but fail to compose them into correct reasoning paths.^[4]

2.3 The Confidence Illusion

Perhaps the most operationally dangerous property of LLMs is the near-constant quality of their surface fluency. Unlike a human expert, who visibly hesitates at the boundary of their knowledge, a language model produces text with structurally identical confidence regardless of whether it draws on densely trained knowledge or confabulates in a sparse region.^[5] Xiong et al. (2024) demonstrate that LLMs, when asked to verbalize their confidence, systematically exhibit overconfidence — potentially imitating human patterns of expressing certainty — and that all investigated methods continue to struggle with challenging tasks, leaving significant scope for improvement in confidence elicitation.^[6]

Figure 2

Model Accuracy vs. Training Distribution Density — The Hidden Staircase

Source: Conceptual model — Wei et al. (2022) [7]; Dziri et al. (2023) [4]; ZTrader AI Research

§ 03

Emergence as Resolution Artifact

Few concepts in contemporary AI discourse have been more consequential — and more systematically misunderstood — than emergence. The standard narrative, following Wei et al. (2022), describes certain capabilities as emergent because they appear abruptly at specific model scale thresholds, seemingly absent below and suddenly present above.^[18] This framing carries strong implications: intelligence undergoes phase transitions; there exist non-linear scaling laws; the system is doing something categorically new at scale.

A 2023 paper by Schaeffer, Miranda, and Koyejo challenged this narrative with methodological precision.^[8] Their central finding: apparent emergence is largely an artifact of evaluation metric choice. When researchers use metrics with sharp discontinuities — binary correct/incorrect, exact string match — small incremental improvements in underlying model capability produce sudden visible jumps at evaluation time. Switch to a smooth, continuous metric measuring the same capability, and the apparent emergence disappears entirely, replaced by a smooth, predictable linear curve. The emergence was in the measurement instrument, not in the model.

"Emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale."
— Schaeffer, Miranda & Koyejo (2023)

3.1 The Benchmark Flatness Problem

Modern LLM benchmarks — MMLU, HumanEval, BIG-Bench, HELM — produce aggregate scores: single numbers enabling comparison across models. This compression from high-dimensional performance space to a scalar is itself a form of low-resolution observation. A model scoring 90% on MMLU is not uniformly 90% capable. It may be 99% capable on high-school biology and 65% capable on formal logic. The aggregate masks the staircase.^[9]

Capability Domain	Distribution Density	Est. Accuracy (GPT-4)	Primary Failure Mode	Operational Status
Common factual recall	Extreme	97–99%	Stale facts post-cutoff	■ RELIABLE
Multi-step arithmetic	Moderate	70–85%	Carry errors, digit confusion	◆ VERIFY
Formal logical deduction	Low	55–70%	Affirming the consequent	◆ VERIFY
Novel code generation	Low–Moderate	60–78%	Syntactically valid, semantically wrong	◆ VERIFY
Causal inference	Very Low	40–55%	Correlation → causation	✕ DO NOT TRUST
Self-knowledge / calibration	Minimal	30–50%	Confidently wrong self-reports	✕ DO NOT TRUST
Truly novel analogy	Near-zero	20–40%	Superficial pattern matching	✕ DO NOT TRUST

Table 1 — Estimated ranges synthesized from: GPT-4 Technical Report (2023) [17]; Dziri et al. (2023) [4]; Huang & Chang (2023) [10]

Figure 3

Why Emergence Looks Like a Phase Transition — The Metric Effect (Schaeffer et al., 2023)

Source: Schaeffer, Miranda & Koyejo (2023) — "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS 2023 [8] · ZTrader AI Research

§ 04

The Ideal Curve That Doesn't Exist

The image's most haunting annotation is the simplest: ideal curve, written twice — once inside the magnifying lens, once outside it. The ideal curve is the same in both views. What changes is what surrounds it. Inside the lens, the staircase approximation is visible. Outside it, the approximation is invisible. The ideal curve — what we actually want from intelligence — persists as an aspiration that the current architecture can approach but never inhabit.

What would a system actually on the ideal curve look like? It would perform genuine symbolic manipulation — not statistical pattern completion. It would maintain consistent beliefs across a conversation and know when those beliefs were uncertain. It would distinguish between what it knows and what it is inferring. It would refuse confidently rather than confabulate smoothly when operating outside its knowledge boundary.^[10]

None of these properties emerge reliably from scale alone. Scaling laws — the empirical observation that model performance improves predictably with compute, data, and parameters — describe improvement along the approximation curve. They do not describe approach toward the ideal curve. These are different trajectories.^[11]

Scaling gets you closer to the optimal approximation of your training distribution. It does not, by itself, change the nature of what is being approximated. The architecture ceiling is not a compute problem. It is a representational problem.

4.1 The Current Research Frontier

Chain-of-thought prompting (Wei et al., 2022) partially externalizes reasoning steps, reducing the cognitive load on single forward passes.^[7] Tool use and retrieval augmentation extend the model's effective knowledge without addressing its reasoning architecture. Constitutional AI and RLHF improve calibration on known failure modes but do not eliminate the underlying approximation structure.^[12] More fundamental approaches — neurosymbolic integration, formal verification of model outputs, mechanistic interpretability — remain early-stage research with no clear path to production deployment at LLM scale.^[13][14] The ideal curve remains, for now, a theoretical boundary condition rather than an engineering target.

§ 05

Implications for Macro Research & Trading

5.1 What LLMs Do Well in This Domain

Large language models are genuinely powerful for text-dense, pattern-rich tasks that sit in high-distribution-density regions: summarizing central bank communications, synthesizing cross-language financial news, generating structured hypothesis frameworks from verbal briefings, identifying surface-level rhetorical shifts in earnings calls. These are real productivity gains with measurable value.^[15]

5.2 Where the Staircase Becomes Dangerous

The failure modes are precisely located. Numerical reasoning — rate path arithmetic, duration calculations, option payoff structures — sits in low-distribution-density territory. Causal inference — "if the Fed hikes here, does the yield curve steepen or flatten, and why?" — requires exactly the kind of robust counterfactual reasoning that LLMs systematically fail to produce reliably. Self-reported uncertainty — "how confident are you in this analysis?" — is actively misleading, as the model's confidence signal does not track epistemic accuracy.^[5][6]

"The model will give you a beautifully structured analysis of a Fed policy shift with the same fluency whether the underlying logic is sound or completely inverted. Fluency is not evidence."

5.3 Operational Calibration Protocol

The correct operational stance is neither uncritical adoption nor reflexive rejection — it is structured integration with explicit failure zone mapping. Use LLMs where distribution density is high and verification is cheap. Treat LLM outputs in low-density zones as first-pass hypotheses requiring independent verification. Never route numerical calculations, causal inferences, or probability estimates through an LLM without independent validation.

Figure 4

LLM Reliability Matrix for Macro Research — Operational Decision Guide

Source: ZTrader AI Research — Operational framework synthesized from Bubeck et al. (2023) [15]; Huang & Chang (2023) [10] · ztrader.ai

§ 06

The Epistemology of Roughness

There is a deeper argument beneath the technical one. The demand for smooth intelligence is itself a cognitive bias — a preference for legibility over accuracy that pervades not just AI evaluation but market analysis, macroeconomic forecasting, and institutional risk management.

Markets are not smooth. Economic causation is not smooth. Policy transmission is not smooth. The real world is constituted by exactly the kind of discrete, edge-laden, approximation-bound structures that make us uncomfortable when we encounter them in AI systems. We build smooth models of rough realities because smooth models are tractable. We forget that the tractability is ours, not the world's.^[16]

The magnifying lens in our cover image is therefore a tool of epistemic hygiene. It does not reveal something wrong with the staircase. The staircase is the honest representation. The smooth curve is the convenient fiction. What the lens does is prevent us from confusing the fiction for the fact — a confusion that, in financial markets, carries asymmetric consequences.

"The roughness of reality is not a problem to be solved. It is information to be incorporated. The model that pretends otherwise is not being helpful. It is being dangerous."

§ 07

Conclusions: What Honesty Requires

This paper has argued a chain of claims. First, LLMs are discrete approximation systems — their apparent smoothness is a product of high parameter density in high-training-density regions, not of genuine continuous reasoning capability. Second, the approximation error is non-uniform and unsignaled: the model does not reliably indicate when it is operating in sparse territory. Third, emergent capabilities are significantly — though not entirely — measurement artifacts, driven by discontinuous evaluation metrics rather than genuine architectural phase transitions. Fourth, practitioners in precision-dependent domains must map the staircase explicitly: identifying which tasks sit in reliable zones and which require independent verification before any output is acted upon.

None of this is cause for abandonment. The capabilities are real. The productivity gains are real. The value of having a broadly competent, infinitely patient, multilingual analytical assistant available at near-zero marginal cost is genuinely transformative. These facts coexist with the structural limitations described here.

What honesty requires is that we stop choosing our observation distance based on what we want to see. The ideal curve is a useful design target. The staircase is the current reality. The magnifying glass — the discipline of high-resolution evaluation — is not pessimism. It is the precondition for using these systems well.

Final Claim: Intelligence at low resolution is not intelligence. It is compression. Treating compression as cognition is the fundamental error of our current moment in AI deployment. Correction requires not less enthusiasm for these tools, but more precision about what they are.

Intelligence is smooth only at low resolution.
— ZTRADER AI RESEARCH · SEE THE STRUCTURE · 洞若观火

§ REF

References & Verification Status

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems (NeurIPS), 30, pp. 5998–6008. arXiv:1706.03762. ✓ VERIFIED

[2] Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pp. 1715–1725. arXiv:1508.07909. ✓ VERIFIED

[3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … & Amodei, D. (2020). Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems (NeurIPS), 33. arXiv:2005.14165. ✓ VERIFIED

[4] Dziri, N., Lu, X., Sclar, M., Li, X. L., Jiang, L., Lin, B. Y., … & Choi, Y. (2023). Faith and Fate: Limits of Transformers on Compositionality. Advances in Neural Information Processing Systems (NeurIPS 2023, Spotlight). arXiv:2305.18654. ✓ VERIFIED

[5] Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., … & Kaplan, J. (2022). Language Models (Mostly) Know What They Know. Anthropic Technical Report. arXiv:2207.05221. ✓ VERIFIED

[6] Xiong, M., Hu, Z., Lu, X., Li, Y., Fu, J., He, J., & Hooi, B. (2024). Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. International Conference on Learning Representations (ICLR 2024). arXiv:2306.13063. ✓ VERIFIED

[7] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems (NeurIPS 2022). arXiv:2201.11903. ✓ VERIFIED

[8] Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? Advances in Neural Information Processing Systems (NeurIPS 2023). arXiv:2304.15004. ✓ VERIFIED

[9] Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., … et al. (2022). Holistic Evaluation of Language Models (HELM). Transactions on Machine Learning Research (Aug 2023). arXiv:2211.09110. ✓ VERIFIED

[10] Huang, J., & Chang, K. C.-C. (2023). Towards Reasoning in Large Language Models: A Survey. Findings of the Association for Computational Linguistics (ACL 2023). arXiv:2212.10403. ✓ VERIFIED

[11] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., … & Sifre, L. (2022). Training Compute-Optimal Large Language Models (Chinchilla). Advances in Neural Information Processing Systems (NeurIPS 2022). arXiv:2203.15556. ✓ VERIFIED

[12] Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., … & Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. Anthropic Technical Report. arXiv:2212.08073. ✓ VERIFIED

[13] Marcus, G. (2022). Deep Learning is Hitting a Wall. Nautilus Magazine. March 10, 2022. ✓ VERIFIED

[14] Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., … & Olah, C. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread, Anthropic. Published December 22, 2021. https://transformer-circuits.pub/2021/framework/index.html ✓ VERIFIED

[15] Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., … & Zhang, Y. (2023). Sparks of Artificial General Intelligence: Early Experiments with GPT-4. Microsoft Research. arXiv:2303.12712. ✎ CORRECTED (was 2303.12528)

[16] Taleb, N. N. (2010). The Black Swan: The Impact of the Highly Improbable. 2nd ed. Random House. ISBN 978-0812973815. ✓ VERIFIED

[17] OpenAI. (2023). GPT-4 Technical Report. arXiv:2303.08774. ✓ VERIFIED

[18] Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., … & Fedus, W. (2022). Emergent Abilities of Large Language Models. Transactions on Machine Learning Research. arXiv:2206.07682. ✓ VERIFIED