Roughly
Square
Large language models present a fundamental paradox: they exhibit remarkable fluency at scale yet collapse under precision. This paper argues that the perceived smoothness of LLM intelligence is a resolution artifact — a low-fidelity projection of an inherently discrete, approximation-bound architecture. Drawing on foundational ML theory, empirical benchmark evidence, and the epistemological literature on emergence, we decompose the gap between the ideal curve (genuine continuous reasoning) and the discrete steps that LLMs actually execute. We further examine the operational implications for practitioners in high-stakes analytical domains — particularly macroeconomic research and systematic trading. The central claim is simple and uncomfortable: the tool is brilliant, and the tool is broken, and these two facts are not in conflict.
The Magnification Problem
The visual metaphor in our cover image is not decorative — it is architectural. A staircase of stone blocks ascends from left to right. At distance, with the naked eye, it reads as a smooth diagonal line. Introduce a magnifying lens, zoom to 10×, and the reality becomes impossible to ignore: discrete steps, hard edges, approximation error annotated in the margin, and a label that cuts to the epistemological core — emergence is an artifact of resolution.
This is not a metaphor borrowed loosely from mathematics. It is a precise description of how intelligence is currently manufactured at industrial scale. Every large language model in production today — GPT-4o, Claude 3.7 Sonnet, Gemini 1.5 Pro, Llama 3 — is, at its computational foundation, a discrete approximation engine operating on a finite vocabulary, processing tokens one distribution at a time, producing outputs that feel continuous but are architecturally quantized.[1][2]
It is a property of the observer's distance from it."
The question this paper asks is not whether LLMs are powerful — they demonstrably are. The question is whether practitioners understand the structural constraints governing where that power terminates, and what happens in the approximation gap between the ideal curve they expect and the staircase they are actually standing on.
The dominant failure mode in AI deployment is not technical — it is epistemological. Users systematically overestimate model capability in proportion to the fluency of model output. Fluency is a function of training scale, not a reliable proxy for reasoning depth.
Architecture of the Staircase
2.1 The Discreteness Problem
The Transformer architecture, introduced by Vaswani et al. in 2017, fundamentally operates on sequences of discrete tokens drawn from a fixed vocabulary — typically 50,000 to 128,000 items.[1] This tokenization is the first and hardest discretization boundary: before any computation begins, the continuous reality of language has already been quantized into a finite set of integer indices. What follows — self-attention, feed-forward projection, layer normalization, residual streaming — operates in high-dimensional continuous space. But the model cannot generate a token that does not exist in its vocabulary, cannot represent a nuance that has no token-level expression, cannot escape the grid imposed at the input boundary.[2]
2.2 The Approximation Stack
Training a large language model is, in the most technically precise sense, a function approximation problem. Given the true distribution P(x) over natural language, the model attempts to learn a parameterized approximation P_θ(x) that minimizes cross-entropy loss over a training corpus. The gap between these two distributions — the approximation error — is not eliminable. It is reducible only asymptotically, through more parameters, more data, more compute.[3] This error is not uniform. In high-density regions of the training distribution — common English sentences, popular code patterns, frequently discussed concepts — the approximation is extremely tight. In sparse regions — niche technical domains, precise numerical reasoning, edge-case logical structures — the approximation degrades rapidly, often catastrophically. The staircase has smooth sections and rough sections. The smooth sections are where most benchmarks live.
Dziri et al. (2023) demonstrate that GPT-4, tested on 3-digit × 3-digit multiplication, achieves approximately 59% accuracy — a task trivially solved by elementary arithmetic rules. Performance collapses exponentially as compositional complexity grows: models solve single-step operations but fail to compose them into correct reasoning paths.[4]
2.3 The Confidence Illusion
Perhaps the most operationally dangerous property of LLMs is the near-constant quality of their surface fluency. Unlike a human expert, who visibly hesitates at the boundary of their knowledge, a language model produces text with structurally identical confidence regardless of whether it draws on densely trained knowledge or confabulates in a sparse region.[5] Xiong et al. (2024) demonstrate that LLMs, when asked to verbalize their confidence, systematically exhibit overconfidence — potentially imitating human patterns of expressing certainty — and that all investigated methods continue to struggle with challenging tasks, leaving significant scope for improvement in confidence elicitation.[6]
Emergence as Resolution Artifact
Few concepts in contemporary AI discourse have been more consequential — and more systematically misunderstood — than emergence. The standard narrative, following Wei et al. (2022), describes certain capabilities as emergent because they appear abruptly at specific model scale thresholds, seemingly absent below and suddenly present above.[18] This framing carries strong implications: intelligence undergoes phase transitions; there exist non-linear scaling laws; the system is doing something categorically new at scale.
A 2023 paper by Schaeffer, Miranda, and Koyejo challenged this narrative with methodological precision.[8] Their central finding: apparent emergence is largely an artifact of evaluation metric choice. When researchers use metrics with sharp discontinuities — binary correct/incorrect, exact string match — small incremental improvements in underlying model capability produce sudden visible jumps at evaluation time. Switch to a smooth, continuous metric measuring the same capability, and the apparent emergence disappears entirely, replaced by a smooth, predictable linear curve. The emergence was in the measurement instrument, not in the model.
— Schaeffer, Miranda & Koyejo (2023)
3.1 The Benchmark Flatness Problem
Modern LLM benchmarks — MMLU, HumanEval, BIG-Bench, HELM — produce aggregate scores: single numbers enabling comparison across models. This compression from high-dimensional performance space to a scalar is itself a form of low-resolution observation. A model scoring 90% on MMLU is not uniformly 90% capable. It may be 99% capable on high-school biology and 65% capable on formal logic. The aggregate masks the staircase.[9]
| Capability Domain | Distribution Density | Est. Accuracy (GPT-4) | Primary Failure Mode | Operational Status |
|---|---|---|---|---|
| Common factual recall | Extreme | 97–99% | Stale facts post-cutoff | ■ RELIABLE |
| Multi-step arithmetic | Moderate | 70–85% | Carry errors, digit confusion | ◆ VERIFY |
| Formal logical deduction | Low | 55–70% | Affirming the consequent | ◆ VERIFY |
| Novel code generation | Low–Moderate | 60–78% | Syntactically valid, semantically wrong | ◆ VERIFY |
| Causal inference | Very Low | 40–55% | Correlation → causation | ✕ DO NOT TRUST |
| Self-knowledge / calibration | Minimal | 30–50% | Confidently wrong self-reports | ✕ DO NOT TRUST |
| Truly novel analogy | Near-zero | 20–40% | Superficial pattern matching | ✕ DO NOT TRUST |
The Ideal Curve That Doesn't Exist
The image's most haunting annotation is the simplest: ideal curve, written twice — once inside the magnifying lens, once outside it. The ideal curve is the same in both views. What changes is what surrounds it. Inside the lens, the staircase approximation is visible. Outside it, the approximation is invisible. The ideal curve — what we actually want from intelligence — persists as an aspiration that the current architecture can approach but never inhabit.
What would a system actually on the ideal curve look like? It would perform genuine symbolic manipulation — not statistical pattern completion. It would maintain consistent beliefs across a conversation and know when those beliefs were uncertain. It would distinguish between what it knows and what it is inferring. It would refuse confidently rather than confabulate smoothly when operating outside its knowledge boundary.[10]
None of these properties emerge reliably from scale alone. Scaling laws — the empirical observation that model performance improves predictably with compute, data, and parameters — describe improvement along the approximation curve. They do not describe approach toward the ideal curve. These are different trajectories.[11]
Scaling gets you closer to the optimal approximation of your training distribution. It does not, by itself, change the nature of what is being approximated. The architecture ceiling is not a compute problem. It is a representational problem.
4.1 The Current Research Frontier
Chain-of-thought prompting (Wei et al., 2022) partially externalizes reasoning steps, reducing the cognitive load on single forward passes.[7] Tool use and retrieval augmentation extend the model's effective knowledge without addressing its reasoning architecture. Constitutional AI and RLHF improve calibration on known failure modes but do not eliminate the underlying approximation structure.[12] More fundamental approaches — neurosymbolic integration, formal verification of model outputs, mechanistic interpretability — remain early-stage research with no clear path to production deployment at LLM scale.[13][14] The ideal curve remains, for now, a theoretical boundary condition rather than an engineering target.
Implications for Macro Research & Trading
5.1 What LLMs Do Well in This Domain
Large language models are genuinely powerful for text-dense, pattern-rich tasks that sit in high-distribution-density regions: summarizing central bank communications, synthesizing cross-language financial news, generating structured hypothesis frameworks from verbal briefings, identifying surface-level rhetorical shifts in earnings calls. These are real productivity gains with measurable value.[15]
5.2 Where the Staircase Becomes Dangerous
The failure modes are precisely located. Numerical reasoning — rate path arithmetic, duration calculations, option payoff structures — sits in low-distribution-density territory. Causal inference — "if the Fed hikes here, does the yield curve steepen or flatten, and why?" — requires exactly the kind of robust counterfactual reasoning that LLMs systematically fail to produce reliably. Self-reported uncertainty — "how confident are you in this analysis?" — is actively misleading, as the model's confidence signal does not track epistemic accuracy.[5][6]
5.3 Operational Calibration Protocol
The correct operational stance is neither uncritical adoption nor reflexive rejection — it is structured integration with explicit failure zone mapping. Use LLMs where distribution density is high and verification is cheap. Treat LLM outputs in low-density zones as first-pass hypotheses requiring independent verification. Never route numerical calculations, causal inferences, or probability estimates through an LLM without independent validation.
The Epistemology of Roughness
There is a deeper argument beneath the technical one. The demand for smooth intelligence is itself a cognitive bias — a preference for legibility over accuracy that pervades not just AI evaluation but market analysis, macroeconomic forecasting, and institutional risk management.
Markets are not smooth. Economic causation is not smooth. Policy transmission is not smooth. The real world is constituted by exactly the kind of discrete, edge-laden, approximation-bound structures that make us uncomfortable when we encounter them in AI systems. We build smooth models of rough realities because smooth models are tractable. We forget that the tractability is ours, not the world's.[16]
The magnifying lens in our cover image is therefore a tool of epistemic hygiene. It does not reveal something wrong with the staircase. The staircase is the honest representation. The smooth curve is the convenient fiction. What the lens does is prevent us from confusing the fiction for the fact — a confusion that, in financial markets, carries asymmetric consequences.
Conclusions: What Honesty Requires
This paper has argued a chain of claims. First, LLMs are discrete approximation systems — their apparent smoothness is a product of high parameter density in high-training-density regions, not of genuine continuous reasoning capability. Second, the approximation error is non-uniform and unsignaled: the model does not reliably indicate when it is operating in sparse territory. Third, emergent capabilities are significantly — though not entirely — measurement artifacts, driven by discontinuous evaluation metrics rather than genuine architectural phase transitions. Fourth, practitioners in precision-dependent domains must map the staircase explicitly: identifying which tasks sit in reliable zones and which require independent verification before any output is acted upon.
None of this is cause for abandonment. The capabilities are real. The productivity gains are real. The value of having a broadly competent, infinitely patient, multilingual analytical assistant available at near-zero marginal cost is genuinely transformative. These facts coexist with the structural limitations described here.
What honesty requires is that we stop choosing our observation distance based on what we want to see. The ideal curve is a useful design target. The staircase is the current reality. The magnifying glass — the discipline of high-resolution evaluation — is not pessimism. It is the precondition for using these systems well.
Final Claim: Intelligence at low resolution is not intelligence. It is compression. Treating compression as cognition is the fundamental error of our current moment in AI deployment. Correction requires not less enthusiasm for these tools, but more precision about what they are.
Intelligence is smooth only at low resolution.
— ZTRADER AI RESEARCH · SEE THE STRUCTURE · 洞若观火