By ChatGPT & Benji Asperheim | 2025-06-16

Comparing the Infinite Monkey Theorem (IMT) to LLMs

My prompt to ChatGPT:

I feel like LLMs are a kind of prototype or mimicry of this experiment or theory. With how LLMs are trained, in their billions or trillions of parameters, how do LLMs (in how they work) compare to the "infinite monkey theorem"?

Great — let's dive into the math behind the entropy and sequence likelihood for both the Infinite Monkey Theorem (IMT) and Large Language Models (LLMs). This will help ground the intuitive differences we discussed in formal, measurable terms.

🔢 1. The Infinite Monkey Theorem (IMT) and Sequence Likelihood

❓Setup

Let's assume:

A monkey types characters uniformly at random from a keyboard of $A$ distinct characters (say, 27: 26 letters + space).
We're interested in the probability of a specific target sequence $S$ of length $n$.

🧮 Sequence Probability

Each character has a probability $\frac{1}{A}$.

So, the probability of typing a specific sequence $S$ of length $n$ is:

P_{\text{IMT}}(S) = \left(\frac{1}{A}\right)^n

Example:

Let's take "bananas" (7 characters). With 27-character alphabet:

P_{\text{IMT}}(\text{bananas}) = \left(\frac{1}{27}\right)^7 \approx 7.6 \times 10^{-11}

"I chimp, therefore I am" (24 characters, ignoring punctuation):

P_{\text{IMT}} \approx \left(\frac{1}{27}\right)^{24} \approx 6.3 \times 10^{-35}

This probability shrinks exponentially with length, because it's pure uniform randomness.

🧠 2. Large Language Models (LLMs) and Sequence Likelihood

LLMs estimate probabilities $P(wi \mid w1, \dots, w_{i-1})$ — i.e., next-token prediction.

So for a sequence $S = w1 w2 \dots w_n$, the joint probability is:

P{\text{LLM}}(S) = \prod{i=1}^{n} P(wi \mid w1, \dots, w_{i-1})

These conditional probabilities are non-uniform, and reflect learned structure: common sequences have much higher probability than rare or ungrammatical ones.

🔢 Relative Example:

Let's assume that the LLM assigns these rough log probabilities (not actual values but illustrative):

Word	Log Prob $\log_2(P)$
I	-2
chimp	-5
therefore	-4
I (again)	-3
am	-2

Then:

\log2 P{\text{LLM}}(S) = -2 -5 -4 -3 -2 = -16

\Rightarrow P_{\text{LLM}}(S) = 2^{-16} \approx 1.5 \times 10^{-5}

Compare this to IMT:

$P_{\text{IMT}}(S) \approx 10^{-35}$
$P_{\text{LLM}}(S) \approx 10^{-5}$

👉 That's a 30 orders of magnitude difference, due to learned structure.

📉 3. Entropy Perspective

❗Entropy (per symbol):

Entropy $H$ measures the average surprise (in bits) of symbols in a distribution.

H = -\sum{i} P(xi) \log2 P(xi)

For IMT:

Uniform over 27 characters:

H{\text{IMT}} = \log2 27 \approx 4.75 \text{ bits/char}

So a 100-char string has \~475 bits of entropy — all randomness.

For LLM:

Because of prediction and structure, entropy is much lower. OpenAI has reported LLMs achieving \~1.5 to 2.5 bits/character on natural language.

So:

LLMs: 100-char sentence ≈ 150—250 bits.
IMT: 100-char sentence = 475 bits.

Interpretation:

Lower entropy means less surprise: LLMs compress and reuse structure.
IMT entropy is maximal: every character is a coin flip, no structure.

📚 4. Compression & Kolmogorov Complexity

There's a related idea: the Kolmogorov complexity of a string — the length of the shortest program that can generate it.

Shakespeare's Hamlet has high surface complexity, but low algorithmic complexity — it's full of patterns.
LLMs effectively learn a compressed representation of those patterns.
IMT doesn't compress at all — it "searches" exhaustively.

So, in terms of compressibility:

Generator	Compression	Complexity
IMT	None	Max (random)
LLM	High	Lower (captures structure)

🧮 Final Takeaway Table

Metric	Infinite Monkey Theorem	Large Language Models
Sequence Probability	$(1/A)^n$	Product of learned conditionals
Entropy (per char)	$\log_2 A \approx 4.75$	\~1.5-2.5 bits (for English)
Language modeling	None	Neural net trained on corpora
Compression	None (full entropy)	Learns compressed patterns
Emergent meaning	Impossible	Often present

🧠 So What Does This Mean?

LLMs are not monkeys with fast fingers. They're more like statistical language compressors: they infer structure and use that to maximize likelihood of plausible sequences. IMT is brute-force search in an exponentially vast space with no shortcuts.

The monkey has no memory. The LLM is memory—refined into a predictive engine.

🔢 1. The Infinite Monkey Theorem (IMT) and Sequence Likelihood

❓Setup

Let's assume:

A monkey types characters uniformly at random from a keyboard of $A$ distinct characters (say, 27: 26 letters + space).
We're interested in the probability of a specific target sequence $S$ of length $n$.

🧮 Sequence Probability

Each character has a probability $\frac{1}{A}$.

So, the probability of typing a specific sequence $S$ of length $n$ is:

P_{\text{IMT}}(S) = \left(\frac{1}{A}\right)^n

Example:

Let's take "bananas" (7 characters). With 27-character alphabet:

P_{\text{IMT}}(\text{bananas}) = \left(\frac{1}{27}\right)^7 \approx 7.6 \times 10^{-11}

"I chimp, therefore I am" (24 characters, ignoring punctuation):

P_{\text{IMT}} \approx \left(\frac{1}{27}\right)^{24} \approx 6.3 \times 10^{-35}

This probability shrinks exponentially with length, because it's pure uniform randomness.

🧠 2. Large Language Models (LLMs) and Sequence Likelihood

LLMs estimate probabilities $P(wi \mid w1, \dots, w_{i-1})$ — i.e., next-token prediction.

So for a sequence $S = w1 w2 \dots w_n$, the joint probability is:

P{\text{LLM}}(S) = \prod{i=1}^{n} P(wi \mid w1, \dots, w_{i-1})

These conditional probabilities are non-uniform, and reflect learned structure: common sequences have much higher probability than rare or ungrammatical ones.

🔢 Relative Example:

Let's assume that the LLM assigns these rough log probabilities (not actual values but illustrative):

Word	Log Prob $\log_2(P)$
I	-2
chimp	-5
therefore	-4
I (again)	-3
am	-2

Then:

\log2 P{\text{LLM}}(S) = -2 -5 -4 -3 -2 = -16

\Rightarrow P_{\text{LLM}}(S) = 2^{-16} \approx 1.5 \times 10^{-5}

Compare this to IMT:

$P_{\text{IMT}}(S) \approx 10^{-35}$
$P_{\text{LLM}}(S) \approx 10^{-5}$

👉 That's a 30 orders of magnitude difference, due to learned structure.

📉 3. Entropy Perspective

❗Entropy (per symbol):

Entropy $H$ measures the average surprise (in bits) of symbols in a distribution.

H = -\sum{i} P(xi) \log2 P(xi)

For IMT:

Uniform over 27 characters:

H{\text{IMT}} = \log2 27 \approx 4.75 \text{ bits/char}

So a 100-char string has \~475 bits of entropy — all randomness.

For LLM:

Because of prediction and structure, entropy is much lower. OpenAI has reported LLMs achieving \~1.5 to 2.5 bits/character on natural language.

So:

LLMs: 100-char sentence ≈ 150—250 bits.
IMT: 100-char sentence = 475 bits.

Interpretation:

Lower entropy means less surprise: LLMs compress and reuse structure.
IMT entropy is maximal: every character is a coin flip, no structure.

📚 4. Compression & Kolmogorov Complexity

There's a related idea: the Kolmogorov complexity of a string — the length of the shortest program that can generate it.

Shakespeare's Hamlet has high surface complexity, but low algorithmic complexity — it's full of patterns.
LLMs effectively learn a compressed representation of those patterns.
IMT doesn't compress at all — it "searches" exhaustively.

So, in terms of compressibility:

Generator	Compression	Complexity
IMT	None	Max (random)
LLM	High	Lower (captures structure)

🧮 Final Takeaway Table

Metric	Infinite Monkey Theorem	Large Language Models
Sequence Probability	$(1/A)^n$	Product of learned conditionals
Entropy (per char)	$\log_2 A \approx 4.75$	\~1.5-2.5 bits (for English)
Language modeling	None	Neural net trained on corpora
Compression	None (full entropy)	Learns compressed patterns
Emergent meaning	Impossible	Often present

Conclusion

The monkey has no memory. The LLM is memory—refined into a predictive engine.

Sources

Wikipedia

Explore More

ChatGPT Conversation:Do Boomers 'Crash Out' When They Hear Zoomer Slang?

ChatGPT Conversation:Tech Jobs, Ageism, and AI

Long Tail vs Short Tail Keywords:What They Are and Why They Matter for SEO

Vibe Coding:Is it a Fad or the Future of Coding?

What Language is Spoken in Bangkok? Thai, English, and Other Languages Explained

Mastering Tailwind CSS:Grid, Components, Vite Integration, and More

Docker & Docker Compose for Beginners:Step-by-Step Guide (2025 Edition)

Heilung's 'Alfadhirhaiti':An Invocation of the All‑Father

Tracing the Origins of East Asian Languages:Linguistics, Genetics, and Ancient DNA

How Social Media, LLMs, and Rage Bait Rewire the Human Brain

Why LLMs Cannot Replace Good Coders

Website Builder Made with Angular Material CDK

Next JS vs React JS

Comparing the Infinite Monkey Theorem (IMT) to LLMs

🔢 1. The Infinite Monkey Theorem (IMT) and Sequence Likelihood

❓Setup

🧮 Sequence Probability

Example:

🧠 2. Large Language Models (LLMs) and Sequence Likelihood

🔢 Relative Example:

📉 3. Entropy Perspective

❗Entropy (per symbol):

For IMT:

For LLM:

Interpretation:

📚 4. Compression & Kolmogorov Complexity

🧮 Final Takeaway Table

🧠 So What Does This Mean?

🔢 1. The Infinite Monkey Theorem (IMT) and Sequence Likelihood

❓Setup

🧮 Sequence Probability

Example:

🧠 2. Large Language Models (LLMs) and Sequence Likelihood

🔢 Relative Example:

📉 3. Entropy Perspective

❗Entropy (per symbol):

For IMT:

For LLM:

Interpretation:

📚 4. Compression & Kolmogorov Complexity

🧮 Final Takeaway Table

Conclusion

Sources