
Comparing the Infinite Monkey Theorem (IMT) to LLMs
My prompt to ChatGPT:
I feel like LLMs are a kind of prototype or mimicry of this experiment or theory. With how LLMs are trained, in their billions or trillions of parameters, how do LLMs (in how they work) compare to the "infinite monkey theorem"?
Great — let's dive into the math behind the entropy and sequence likelihood for both the Infinite Monkey Theorem (IMT) and Large Language Models (LLMs). This will help ground the intuitive differences we discussed in formal, measurable terms.
🔢 1. The Infinite Monkey Theorem (IMT) and Sequence Likelihood
❓Setup
Let's assume:
- A monkey types characters uniformly at random from a keyboard of $A$ distinct characters (say, 27: 26 letters + space).
- We're interested in the probability of a specific target sequence $S$ of length $n$.
🧮 Sequence Probability
Each character has a probability $\frac{1}{A}$.
So, the probability of typing a specific sequence $S$ of length $n$ is:
$$
P_{\text{IMT}}(S) = \left(\frac{1}{A}\right)^n
$$
Example:
Let's take "bananas" (7 characters). With 27-character alphabet:
$$
P_{\text{IMT}}(\text{bananas}) = \left(\frac{1}{27}\right)^7 \approx 7.6 \times 10^{-11}
$$
"I chimp, therefore I am" (24 characters, ignoring punctuation):
$$
P_{\text{IMT}} \approx \left(\frac{1}{27}\right)^{24} \approx 6.3 \times 10^{-35}
$$
This probability shrinks exponentially with length, because it's pure uniform randomness.
---
🧠 2. Large Language Models (LLMs) and Sequence Likelihood
LLMs estimate probabilities $P(wi \mid w1, \dots, w_{i-1})$ — i.e., next-token prediction.
So for a sequence $S = w1 w2 \dots w_n$, the joint probability is:
$$
P{\text{LLM}}(S) = \prod{i=1}^{n} P(wi \mid w1, \dots, w_{i-1})
$$
These conditional probabilities are non-uniform, and reflect learned structure: common sequences have much higher probability than rare or ungrammatical ones.
🔢 Relative Example:
Let's assume that the LLM assigns these rough log probabilities (not actual values but illustrative):
| Word | Log Prob $\log_2(P)$ |
| --------- | -------------------- |
| I | -2 |
| chimp | -5 |
| therefore | -4 |
| I (again) | -3 |
| am | -2 |
Then:
$$
\log2 P{\text{LLM}}(S) = -2 -5 -4 -3 -2 = -16
\Rightarrow P_{\text{LLM}}(S) = 2^{-16} \approx 1.5 \times 10^{-5}
$$
Compare this to IMT:
- $P_{\text{IMT}}(S) \approx 10^{-35}$
- $P_{\text{LLM}}(S) \approx 10^{-5}$
👉 That's a 30 orders of magnitude difference, due to learned structure.
---
📉 3. Entropy Perspective
❗Entropy (per symbol):
- Entropy $H$ measures the average surprise (in bits) of symbols in a distribution.
$$
H = -\sum{i} P(xi) \log2 P(xi)
$$
For IMT:
- Uniform over 27 characters:
$$
H{\text{IMT}} = \log2 27 \approx 4.75 \text{ bits/char}
$$
- So a 100-char string has \~475 bits of entropy — all randomness.
For LLM:
- Because of prediction and structure, entropy is much lower. OpenAI has reported LLMs achieving \~1.5 to 2.5 bits/character on natural language.
So:
- LLMs: 100-char sentence ≈ 150–250 bits.
- IMT: 100-char sentence = 475 bits.
Interpretation:
- Lower entropy means less surprise: LLMs compress and reuse structure.
- IMT entropy is maximal: every character is a coin flip, no structure.
---
📚 4. Compression & Kolmogorov Complexity
There's a related idea: the Kolmogorov complexity of a string — the length of the shortest program that can generate it.
- Shakespeare's Hamlet has high surface complexity, but low algorithmic complexity — it's full of patterns.
- LLMs effectively learn a compressed representation of those patterns.
- IMT doesn't compress at all — it "searches" exhaustively.
So, in terms of compressibility:
| Generator | Compression | Complexity |
| --------- | ----------- | -------------------------- |
| IMT | None | Max (random) |
| LLM | High | Lower (captures structure) |
---
🧮 Final Takeaway Table
| Metric | Infinite Monkey Theorem | Large Language Models |
| -------------------- | ----------------------- | ------------------------------- |
| Sequence Probability | $(1/A)^n$ | Product of learned conditionals |
| Entropy (per char) | $\log_2 A \approx 4.75$ | \~1.5–2.5 bits (for English) |
| Language modeling | None | Neural net trained on corpora |
| Compression | None (full entropy) | Learns compressed patterns |
| Emergent meaning | Impossible | Often present |
---
🧠 So What Does This Mean?
LLMs are not monkeys with fast fingers. They're more like statistical language compressors: they infer structure and use that to maximize likelihood of plausible sequences. IMT is brute-force search in an exponentially vast space with no shortcuts.
The monkey has no memory. The LLM is memory—refined into a predictive engine.
Great — let's dive into the math behind the entropy and sequence likelihood for both the Infinite Monkey Theorem (IMT) and Large Language Models (LLMs). This will help ground the intuitive differences we discussed in formal, measurable terms.
---
🔢 1. The Infinite Monkey Theorem (IMT) and Sequence Likelihood
❓Setup
Let's assume:
- A monkey types characters uniformly at random from a keyboard of $A$ distinct characters (say, 27: 26 letters + space).
- We're interested in the probability of a specific target sequence $S$ of length $n$.
🧮 Sequence Probability
Each character has a probability $\frac{1}{A}$.
So, the probability of typing a specific sequence $S$ of length $n$ is:
$$
P_{\text{IMT}}(S) = \left(\frac{1}{A}\right)^n
$$
Example:
Let's take “bananas” (7 characters). With 27-character alphabet:
$$
P_{\text{IMT}}(\text{bananas}) = \left(\frac{1}{27}\right)^7 \approx 7.6 \times 10^{-11}
$$
“I chimp, therefore I am” (24 characters, ignoring punctuation):
$$
P_{\text{IMT}} \approx \left(\frac{1}{27}\right)^{24} \approx 6.3 \times 10^{-35}
$$
This probability shrinks exponentially with length, because it's pure uniform randomness.
---
🧠 2. Large Language Models (LLMs) and Sequence Likelihood
LLMs estimate probabilities $P(wi \mid w1, \dots, w_{i-1})$ — i.e., next-token prediction.
So for a sequence $S = w1 w2 \dots w_n$, the joint probability is:
$$
P{\text{LLM}}(S) = \prod{i=1}^{n} P(wi \mid w1, \dots, w_{i-1})
$$
These conditional probabilities are non-uniform, and reflect learned structure: common sequences have much higher probability than rare or ungrammatical ones.
🔢 Relative Example:
Let's assume that the LLM assigns these rough log probabilities (not actual values but illustrative):
| Word | Log Prob $\log_2(P)$ |
| --------- | -------------------- |
| I | -2 |
| chimp | -5 |
| therefore | -4 |
| I (again) | -3 |
| am | -2 |
Then:
$$
\log2 P{\text{LLM}}(S) = -2 -5 -4 -3 -2 = -16
\Rightarrow P_{\text{LLM}}(S) = 2^{-16} \approx 1.5 \times 10^{-5}
$$
Compare this to IMT:
- $P_{\text{IMT}}(S) \approx 10^{-35}$
- $P_{\text{LLM}}(S) \approx 10^{-5}$
👉 That's a 30 orders of magnitude difference, due to learned structure.
---
📉 3. Entropy Perspective
❗Entropy (per symbol):
- Entropy $H$ measures the average surprise (in bits) of symbols in a distribution.
$$
H = -\sum{i} P(xi) \log2 P(xi)
$$
For IMT:
- Uniform over 27 characters:
$$
H{\text{IMT}} = \log2 27 \approx 4.75 \text{ bits/char}
$$
- So a 100-char string has \~475 bits of entropy — all randomness.
For LLM:
- Because of prediction and structure, entropy is much lower. OpenAI has reported LLMs achieving \~1.5 to 2.5 bits/character on natural language.
So:
- LLMs: 100-char sentence ≈ 150–250 bits.
- IMT: 100-char sentence = 475 bits.
Interpretation:
- Lower entropy means less surprise: LLMs compress and reuse structure.
- IMT entropy is maximal: every character is a coin flip, no structure.
---
📚 4. Compression & Kolmogorov Complexity
There's a related idea: the Kolmogorov complexity of a string — the length of the shortest program that can generate it.
- Shakespeare's Hamlet has high surface complexity, but low algorithmic complexity — it’s full of patterns.
- LLMs effectively learn a compressed representation of those patterns.
- IMT doesn't compress at all — it “searches” exhaustively.
So, in terms of compressibility:
| Generator | Compression | Complexity |
| --------- | ----------- | -------------------------- |
| IMT | None | Max (random) |
| LLM | High | Lower (captures structure) |
---
🧮 Final Takeaway Table
| Metric | Infinite Monkey Theorem | Large Language Models |
| -------------------- | ----------------------- | ------------------------------- |
| Sequence Probability | $(1/A)^n$ | Product of learned conditionals |
| Entropy (per char) | $\log_2 A \approx 4.75$ | \~1.5–2.5 bits (for English) |
| Language modeling | None | Neural net trained on corpora |
| Compression | None (full entropy) | Learns compressed patterns |
| Emergent meaning | Impossible | Often present |
---
Conclusion
LLMs are not monkeys with fast fingers. They're more like statistical language compressors: they infer structure and use that to maximize likelihood of plausible sequences. IMT is brute-force search in an exponentially vast space with no shortcuts.
The monkey has no memory. The LLM is memory—refined into a predictive engine.