ChatGPT & Benji Asperheim— Mon Jun 16th, 2025

Comparing the Infinite Monkey Theorem (IMT) to LLMs

My prompt to ChatGPT:

I feel like LLMs are a kind of prototype or mimicry of this experiment or theory. With how LLMs are trained, in their billions or trillions of parameters, how do LLMs (in how they work) compare to the “infinite monkey theorem”?

Great — let’s dive into the math behind the entropy and sequence likelihood for both the Infinite Monkey Theorem (IMT) and Large Language Models (LLMs). This will help ground the intuitive differences we discussed in formal, measurable terms.

🔢 1. The Infinite Monkey Theorem (IMT) and Sequence Likelihood

❓Setup

Let’s assume:

A monkey types characters uniformly at random from a keyboard of $A$ distinct characters (say, 27: 26 letters + space).
We’re interested in the probability of a specific target sequence $S$ of length $n$ .

🧮 Sequence Probability

Each character has a probability $\frac{1}{A}$ . So, the probability of typing a specific sequence $S$ of length $n$ is:

P_{\text{IMT}}(S) = \left(\frac{1}{A}\right)^n

Example:

Let’s take “bananas” (7 characters). With 27-character alphabet:

P_{\text{IMT}}(\text{bananas}) = \left(\frac{1}{27}\right)^7 \approx 7.6 \times 10^{-11}

“I chimp, therefore I am” (24 characters, ignoring punctuation):

P_{\text{IMT}} \approx \left(\frac{1}{27}\right)^{24} \approx 6.3 \times 10^{-35}

This probability shrinks exponentially with length, because it’s pure uniform randomness.

🧠 2. Large Language Models (LLMs) and Sequence Likelihood

LLMs estimate probabilities $P(w_i \mid w_1, \dots, w_{i-1})$ — i.e., next-token prediction.

So for a sequence $S = w_1 w_2 \dots w_n$ , the joint probability is:

P_{\text{LLM}}(S) = \prod_{i=1}^{n} P(w_i \mid w_1, \dots, w_{i-1})

These conditional probabilities are non-uniform, and reflect learned structure: common sequences have much higher probability than rare or ungrammatical ones.

🔢 Relative Example:

Let’s assume that the LLM assigns these rough log probabilities (not actual values but illustrative):

Word	Log Prob $\log_2(P)$
I	-2
chimp	-5
therefore	-4
I (again)	-3
am	-2

Then:

\log_2 P_{\text{LLM}}(S) = -2 -5 -4 -3 -2 = -16 \Rightarrow P_{\text{LLM}}(S) = 2^{-16} \approx 1.5 \times 10^{-5}

Compare this to IMT:

$P_{\text{IMT}}(S) \approx 10^{-35}$
$P_{\text{LLM}}(S) \approx 10^{-5}$

👉 That’s a 30 orders of magnitude difference, due to learned structure.

📉 3. Entropy Perspective

❗Entropy (per symbol):

Entropy $H$ measures the average surprise (in bits) of symbols in a distribution.

H = -\sum_{i} P(x_i) \log_2 P(x_i)

For IMT:

Uniform over 27 characters:

H_{\text{IMT}} = \log_2 27 \approx 4.75 \text{ bits/char}

So a 100-char string has ~475 bits of entropy — all randomness.

For LLM:

Because of prediction and structure, entropy is much lower. OpenAI has reported LLMs achieving ~1.5 to 2.5 bits/character on natural language.

So:

LLMs: 100-char sentence ≈ 150—250 bits.
IMT: 100-char sentence = 475 bits.

Interpretation:

Lower entropy means less surprise: LLMs compress and reuse structure.
IMT entropy is maximal: every character is a coin flip, no structure.

📚 4. Compression & Kolmogorov Complexity

There’s a related idea: the Kolmogorov complexity of a string — the length of the shortest program that can generate it.

Shakespeare’s Hamlet has high surface complexity, but low algorithmic complexity — it’s full of patterns.
LLMs effectively learn a compressed representation of those patterns.
IMT doesn’t compress at all — it “searches” exhaustively.

So, in terms of compressibility:

Generator	Compression	Complexity
IMT	None	Max (random)
LLM	High	Lower (captures structure)

🧮 Final Takeaway Table

Metric	Infinite Monkey Theorem	Large Language Models
Sequence Probability	$(1/A)^n$	Product of learned conditionals
Entropy (per char)	$\log_2 A \approx 4.75$	~1.5-2.5 bits (for English)
Language modeling	None	Neural net trained on corpora
Compression	None (full entropy)	Learns compressed patterns
Emergent meaning	Impossible	Often present

🧠 So What Does This Mean?

LLMs are not monkeys with fast fingers. They’re more like statistical language compressors: they infer structure and use that to maximize likelihood of plausible sequences. IMT is brute-force search in an exponentially vast space with no shortcuts.

The monkey has no memory. The LLM is memory—refined into a predictive engine.