We Speak Through Shared Worlds — A Rate–Distortion View of Human Language

Abstract

Cross-linguistic research shows that languages with higher information density per syllable tend to be spoken more slowly, yielding a broadly similar speech information rate—approximately 39 bits per second—across 17 typologically diverse languages. The tempting interpretation is that this convergence reflects a hard communicative ceiling, a human analogue of Shannon channel capacity. We argue this overstates the evidence. The convergence is better understood as evidence that human languages inhabit a context-adaptive regime of collaborative compression, in which utterance rate, redundancy, and ambiguity are jointly tuned to support low-distortion meaning reconstruction under finite real-time inferential capacity. We develop a context-indexed rate–distortion formalism to capture this, distinguish three interacting families of constraint (physical, parsing, and grounding), and situate the framework across evolutionary, developmental, and interactional timescales. Four empirical predictions follow. The central open challenge is operationalising the context-sensitive distortion function. On this view, common ground is not social background to communication but part of its compression machinery: we do not merely speak in language; we speak through shared worlds.

1 · The Convergence Result and Its Interpretation

A striking cross-linguistic finding has emerged from quantitative corpus work: languages with higher information density—measured as conditional entropy per syllable—are typically spoken more slowly, while languages with lower density are spoken more quickly.¹ Across 17 typologically diverse languages, these factors trade off to produce a broadly similar speech information rate of approximately 39 bits per second. The measured quantity is already a coupled product—conditional entropy per syllable multiplied by syllables per second—so the invariant appears not at the raw acoustic floor but at the point where temporal production constraints and contextual predictability interact.

The tempting interpretation is that this reflects a hard communicative ceiling, a human analogue of Shannon channel capacity: languages cannot exceed what the auditory and neural system can process. But that overstates what the evidence shows. Coupé et al. themselves describe the convergence as a soft constraint and metaphorise languages as occupying a fitness valley—a preferred operating region rather than a knife-edge maximum.¹ Slack is structural. Languages appear to leave room for noise, repair, ambiguity, and interactional adjustment.

We propose that the right framework for understanding this convergence is not a throughput bound but a context-indexed rate–distortion regime. On this view, human languages may converge not on a fixed bit rate, but on a context-adaptive regime of collaborative compression, in which utterance rate, redundancy, and ambiguity are jointly tuned to support low-distortion meaning reconstruction under finite real-time inferential capacity.

Human communication may be better understood as a regime than as a rate.

2 · A Context-Indexed Rate–Distortion Formalism

Rate–distortion theory asks: what is the minimum rate at which a source can be encoded, given an acceptable level of distortion in the reconstruction?^2,3 Applied to communication, the question becomes sharply concrete: in this context, how little do I need to say for you to recover enough of what I mean? A useful local formalism is:

R_{x} (D) = p (u ∣ m, x) min I (M; U ∣ X = x)

subject to E [d_{x} (m, \overset{m}{^})] \leq D

(1)

Given context x, find the least informative signal that still lets the listener recover meaning within an acceptable margin of error.

where m is the intended meaning, u is the utterance, m̂ is the listener's reconstruction, and the minimization is taken over possible encoding strategies p(u | m, x)—the speaker's policy for choosing utterances given meaning and context. The distortion function d_x is context-indexed: it captures how bad a given mismatch between m and m̂ is in context x, which depends on stakes, common ground, task demands, and the anticipated cost of misunderstanding. Losing a hedge in casual conversation and losing a hedge in a surgical instruction are not equivalent distortions—they are governed by different d_x. Crucially, conditioning on X = x also reduces the mutual information term itself: when shared context is rich, the utterance must resolve less uncertainty about the intended meaning, so the minimum required signal is lower. This is the formal mechanism behind shared worlds as compression resources—common ground does not merely raise distortion tolerance; it reduces the signal burden directly.

On this view, redundancy is not mere inefficiency. It is an adaptive resource that buys recoverability when context demands it, and can be relaxed when shared background makes compression safe.^4,5 Experimental work in the noisy-channel tradition shows that listeners do precisely this: they combine the perceived utterance with prior expectations about plausible meanings, making rational inferences about intended content even when the signal is degraded.⁴ Efficient communication therefore appears to operate near a boundary where some ambiguity is tolerated—because full explicitness would be prohibitively costly—but repair remains available when compression overshoots. This distinguishes the present account from looser social-pragmatic frameworks, in which communicative success is often treated more globally: here the target is recoverable meaning under context-sensitive distortion—a notion that is both formally precise and empirically tractable.

That formalism, however, is only a snapshot. In actual conversation, context is not fixed but updated turn by turn through uptake, repair, and accumulating common ground. The full system requires a context-dynamics equation alongside (1):

x_{t + 1} = F (x_{t}, u_{t}, \overset{m}{^}_{t}, repair_{t}, feedback_{t})

(2)

What we can safely mean next depends on what just happened between us—what was said, what was understood, and whether repair occurred.

No fully satisfying theory of F yet exists—this is not a small omission. A static context-indexed distortion function captures only the local geometry of the problem. The stronger claim must therefore remain modest: at each moment, speakers and listeners operate under a local rate–distortion tradeoff, while the context that defines that tradeoff is itself being jointly reshaped by the interaction. Equation (2) is where the real theory must eventually begin.

A plausible first constraint on F is that it is asymmetric: repair and misunderstanding likely update common ground more strongly than smooth success. A miscommunication that is noticed, corrected, and resolved leaves a durable trace—a newly established convention, a clarified referent, a sharpened mutual model—that subsequent smooth exchanges do not. This implies that early coordination failures are not merely costs to be minimised; they may accelerate later compression efficiency, increasing the gains available to interlocutors who persist through them.

3 · Three Interacting Constraint Families

The communicative system is best decomposed not into serial processing stages but into three interacting families of constraints. These are mutually coupled pressures within a recurrent inferential system, not a feedforward pipeline.

Physical Timing · articulation · auditory resolution · acoustic noise · motor rhythm

Parsing Memory decay · chunking · incremental inference · predictive disambiguation · Now-or-Never bottleneck

Grounding Common ground · repair · ambiguity tolerance · stakes · partner model · task demands

The Now-or-Never bottleneck argument makes the recurrent nature of these constraints vivid: because incoming material decays rapidly, the brain must compress and recode input almost immediately, using prediction to resolve local ambiguity before the signal is gone.⁶ Evidence on whether information is spread uniformly across the signal at the syntactic level remains mixed, with recent large-scale work finding limited predictive support for the strongest versions of this claim.⁸ This makes top-down expectation—grounded in shared context—not a supplement to parsing but a structural necessity. There is growing evidence that expectations grounded in shared context shape parsing, and that parsing expectations in turn influence how incoming acoustic material is resolved, with both feeding back into subsequent production choices. The 39 bits/s convergence may then be one visible trace of a deeper equilibrium among these coupled pressures, not an isolated physical-layer fact.

4 · Nested Timescales and Uneven Evidence

This framework operates on nested timescales, and that nesting helps explain both why convergence might emerge and why the evidence for this account is currently uneven.

Proximate Interactional Speakers and listeners adapt in real time—shortening references, exploiting common ground, calibrating repair as conversations unfold.

Developmental Acquisition Speakers internalise predictive and inferential habits that make real-time adaptation possible.

Distal Evolutionary Languages drift under selection pressures generated by countless episodes of actual use. Interaction is the proximate selection environment.

The timescales are not parallel but nested: interaction supplies the proximate selection environment within which languages slowly change. That nesting explains why empirical support is strongest at the interactional level, where adaptation is fastest and directly observable. The collaborative reference literature shows that speakers exploit shared history to shorten expression over time, that listeners keep pace, and critically that this compression is partner-specific—tracking accumulated common ground rather than mere repetition or practice effects.⁷

What remains untested is the stronger explanatory claim that cross-linguistic convergence in speech information rate is best understood through this broader regime of collaborative compression, rather than through narrower motor or auditory constraints alone. If that claim is right, one would expect cross-linguistic regularities not only in average throughput, but in how efficiently interlocutors use shared context to reduce rate across varying conditions of noise, intimacy, expertise, and stakes. If it is wrong, the observed convergence may be largely explicable by lower-level timing and perception constraints, with collaborative compression remaining an interactional phenomenon that does not scale up to explain language-wide patterns.

5 · Shared Worlds as Compression Resources

This framework treats human communication not as maximum-throughput transmission but as context-adaptive collaborative compression. Its central claim is that efficient communication operates not by eliminating ambiguity altogether, but by managing it within a regime where misunderstanding remains repairable. Misunderstanding is therefore not simply failure at the system's edge; it is part of the boundary condition that defines efficient use.

Four Main Predictions

P1 Compression within dialogue should be progressive and partner-specific, tracking accumulated common ground rather than mere repetition or familiarity.

P2 There should be a measurable phase transition between productive compression and overload, marked by sharply rising repair costs once compression exceeds what a listener can reliably reconstruct in real time.

P3 Cross-linguistic comparison should reveal systematic differences in how speech adapts across contexts—noise, expertise, intimacy, stakes—not merely convergence in average throughput.

P4 Listener-side limits on real-time inference should explain more adaptive variation in speech rate and redundancy than speaker-side motor constraints alone. If so, manipulating listener load—through cognitive dual-task, degraded input, or reduced shared context—should produce larger shifts in speaker behaviour than equivalent manipulations of speaker motor difficulty.

More broadly, the framework suggests that human communication is fundamentally relational rather than transmissive. Transmission models treat the signal as the primary object and shared context as background noise or convenience. This account inverts that priority: shared history, expertise, culture, and institutional norms are genuine cognitive compression resources, not decoration. The more common ground two interlocutors possess, the lower the minimum rate required to reconstruct a given meaning within acceptable distortion—and the more the conversation can afford to be elliptical, allusive, and fast. We do not merely speak in language; we speak through shared worlds.

Conclusion

The most defensible conclusion is not that rate–distortion theory already provides a full theory of language. It is that human communication may be better understood as a regime than as a rate. Human languages appear to inhabit a context-adaptive regime of collaborative compression, shaped across interactional, developmental, and evolutionary timescales by the need to support low-distortion meaning reconstruction under finite real-time inferential capacity.

This also gives formal content to a long-standing intuition. Robert Anton Wilson's aphorism—offered as provocation rather than theory—anticipated something the framework can now explain:

Communication is only possible between equals.
Robert Anton Wilson, The Illuminatus! Trilogy (1975)

Wilson's aphorism can be sharpened in formal terms: what communication requires is not equality in the abstract, but sufficient symmetry in common ground and in the distortion profiles under which interlocutors operate. Hierarchy degrades communication not simply because speakers are unequal in status, but because they are forced into asymmetric distortion regimes—what can be safely said, inferred, or repaired is no longer shared. In this sense, communicative failure under hierarchy is not merely social noise but a formal asymmetry in shared worlds and distortion tolerance.

The open empirical challenges are threefold: to operationalise the context-sensitive distortion function d_x; to characterise the dynamics of common-ground updating encoded in F; and to test whether the observed cross-linguistic regularities are better explained by this broader regime than by simpler throughput-based accounts alone. A natural first step toward operationalising d_x is available in controlled dialogue experiments: repair initiation rate per unit of compressed reference provides a behavioural proxy for the point at which distortion exceeds tolerance, and its variation across noise, partner familiarity, and task stakes would map the shape of the function in ecologically realistic settings. The novelty of this proposal is not a new empirical discovery but a new conceptual unification that changes how existing findings are interpreted—and, we hope, what experiments are worth running.

References

Coupé, C., Oh, Y. M., Dediu, D. & Pellegrino, F. Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche. Sci. Adv. 5, eaaw2594 (2019).
Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).
Shannon, C. E. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec. 4, 142–163 (1959).
Gibson, E., Bergen, L. & Piantadosi, S. T. Rational integration of noisy evidence and prior semantic expectations in sentence interpretation. Proc. Natl Acad. Sci. USA 110, 8051–8056 (2013).
Zaslavsky, N., Hu, J. & Levy, R. P. A rate–distortion view of human pragmatic reasoning. Proc. Soc. Comput. Linguist. 4, 347–348 (2021).
Christiansen, M. H. & Chater, N. The Now-or-Never bottleneck: A fundamental constraint on language. Behav. Brain Sci. 39, e62 (2016).
Clark, H. H. & Wilkes-Gibbs, D. Referring as a collaborative process. Cognition 22, 1–39 (1986).
Juzek, T. S. Signal smoothing and syntactic choices: A critical reflection on the UID hypothesis. Open Mind 8, 217–234 (2024).