Comparing the Infinite Monkey Theorem (IMT) to LLMs
ChatGPT & Benji AsperheimMon Jun 16th, 2025

Comparing the Infinite Monkey Theorem (IMT) to LLMs

My prompt to ChatGPT:

I feel like LLMs are a kind of prototype or mimicry of this experiment or theory. With how LLMs are trained, in their billions or trillions of parameters, how do LLMs (in how they work) compare to the “infinite monkey theorem”?

Great — let’s dive into the math behind the entropy and sequence likelihood for both the Infinite Monkey Theorem (IMT) and Large Language Models (LLMs). This will help ground the intuitive differences we discussed in formal, measurable terms.

🔢 1. The Infinite Monkey Theorem (IMT) and Sequence Likelihood

❓Setup

Let’s assume:

🧮 Sequence Probability

Each character has a probability 1A\frac{1}{A}. So, the probability of typing a specific sequence SS of length nn is:

PIMT(S)=(1A)nP_{\text{IMT}}(S) = \left(\frac{1}{A}\right)^n

Example:

Let’s take “bananas” (7 characters). With 27-character alphabet:

PIMT(bananas)=(127)77.6×1011P_{\text{IMT}}(\text{bananas}) = \left(\frac{1}{27}\right)^7 \approx 7.6 \times 10^{-11}

I chimp, therefore I am” (24 characters, ignoring punctuation):

PIMT(127)246.3×1035P_{\text{IMT}} \approx \left(\frac{1}{27}\right)^{24} \approx 6.3 \times 10^{-35}

This probability shrinks exponentially with length, because it’s pure uniform randomness.


🧠 2. Large Language Models (LLMs) and Sequence Likelihood

LLMs estimate probabilities P(wiw1,,wi1)P(w_i \mid w_1, \dots, w_{i-1}) — i.e., next-token prediction.

So for a sequence S=w1w2wnS = w_1 w_2 \dots w_n, the joint probability is:

PLLM(S)=i=1nP(wiw1,,wi1)P_{\text{LLM}}(S) = \prod_{i=1}^{n} P(w_i \mid w_1, \dots, w_{i-1})

These conditional probabilities are non-uniform, and reflect learned structure: common sequences have much higher probability than rare or ungrammatical ones.

🔢 Relative Example:

Let’s assume that the LLM assigns these rough log probabilities (not actual values but illustrative):

WordLog Prob log2(P)\log_2(P)
I-2
chimp-5
therefore-4
I (again)-3
am-2

Then:

log2PLLM(S)=25432=16PLLM(S)=2161.5×105\log_2 P_{\text{LLM}}(S) = -2 -5 -4 -3 -2 = -16 \Rightarrow P_{\text{LLM}}(S) = 2^{-16} \approx 1.5 \times 10^{-5}

Compare this to IMT:

👉 That’s a 30 orders of magnitude difference, due to learned structure.


📉 3. Entropy Perspective

❗Entropy (per symbol):

H=iP(xi)log2P(xi)H = -\sum_{i} P(x_i) \log_2 P(x_i)

For IMT:

HIMT=log2274.75 bits/charH_{\text{IMT}} = \log_2 27 \approx 4.75 \text{ bits/char}

For LLM:

So:

Interpretation:


📚 4. Compression & Kolmogorov Complexity

There’s a related idea: the Kolmogorov complexity of a string — the length of the shortest program that can generate it.

So, in terms of compressibility:

GeneratorCompressionComplexity
IMTNoneMax (random)
LLMHighLower (captures structure)

🧮 Final Takeaway Table

MetricInfinite Monkey TheoremLarge Language Models
Sequence Probability(1/A)n(1/A)^nProduct of learned conditionals
Entropy (per char)log2A4.75\log_2 A \approx 4.75~1.5-2.5 bits (for English)
Language modelingNoneNeural net trained on corpora
CompressionNone (full entropy)Learns compressed patterns
Emergent meaningImpossibleOften present

🧠 So What Does This Mean?

LLMs are not monkeys with fast fingers. They’re more like statistical language compressors: they infer structure and use that to maximize likelihood of plausible sequences. IMT is brute-force search in an exponentially vast space with no shortcuts.

The monkey has no memory. The LLM is memory—refined into a predictive engine.

Great — let’s dive into the math behind the entropy and sequence likelihood for both the Infinite Monkey Theorem (IMT) and Large Language Models (LLMs). This will help ground the intuitive differences we discussed in formal, measurable terms.


🔢 1. The Infinite Monkey Theorem (IMT) and Sequence Likelihood

❓Setup

Let’s assume:

🧮 Sequence Probability

Each character has a probability 1A\frac{1}{A}. So, the probability of typing a specific sequence SS of length nn is:

PIMT(S)=(1A)nP_{\text{IMT}}(S) = \left(\frac{1}{A}\right)^n

Example:

Let’s take “bananas” (7 characters). With 27-character alphabet:

PIMT(bananas)=(127)77.6×1011P_{\text{IMT}}(\text{bananas}) = \left(\frac{1}{27}\right)^7 \approx 7.6 \times 10^{-11}

I chimp, therefore I am” (24 characters, ignoring punctuation):

PIMT(127)246.3×1035P_{\text{IMT}} \approx \left(\frac{1}{27}\right)^{24} \approx 6.3 \times 10^{-35}

This probability shrinks exponentially with length, because it’s pure uniform randomness.


🧠 2. Large Language Models (LLMs) and Sequence Likelihood

LLMs estimate probabilities P(wiw1,,wi1)P(w_i \mid w_1, \dots, w_{i-1}) — i.e., next-token prediction.

So for a sequence S=w1w2wnS = w_1 w_2 \dots w_n, the joint probability is:

PLLM(S)=i=1nP(wiw1,,wi1)P_{\text{LLM}}(S) = \prod_{i=1}^{n} P(w_i \mid w_1, \dots, w_{i-1})

These conditional probabilities are non-uniform, and reflect learned structure: common sequences have much higher probability than rare or ungrammatical ones.

🔢 Relative Example:

Let’s assume that the LLM assigns these rough log probabilities (not actual values but illustrative):

WordLog Prob log2(P)\log_2(P)
I-2
chimp-5
therefore-4
I (again)-3
am-2

Then:

log2PLLM(S)=25432=16PLLM(S)=2161.5×105\log_2 P_{\text{LLM}}(S) = -2 -5 -4 -3 -2 = -16 \Rightarrow P_{\text{LLM}}(S) = 2^{-16} \approx 1.5 \times 10^{-5}

Compare this to IMT:

👉 That’s a 30 orders of magnitude difference, due to learned structure.


📉 3. Entropy Perspective

❗Entropy (per symbol):

H=iP(xi)log2P(xi)H = -\sum_{i} P(x_i) \log_2 P(x_i)

For IMT:

HIMT=log2274.75 bits/charH_{\text{IMT}} = \log_2 27 \approx 4.75 \text{ bits/char}

For LLM:

So:

Interpretation:


📚 4. Compression & Kolmogorov Complexity

There’s a related idea: the Kolmogorov complexity of a string — the length of the shortest program that can generate it.

So, in terms of compressibility:

GeneratorCompressionComplexity
IMTNoneMax (random)
LLMHighLower (captures structure)

🧮 Final Takeaway Table

MetricInfinite Monkey TheoremLarge Language Models
Sequence Probability(1/A)n(1/A)^nProduct of learned conditionals
Entropy (per char)log2A4.75\log_2 A \approx 4.75~1.5-2.5 bits (for English)
Language modelingNoneNeural net trained on corpora
CompressionNone (full entropy)Learns compressed patterns
Emergent meaningImpossibleOften present

Conclusion

LLMs are not monkeys with fast fingers. They’re more like statistical language compressors: they infer structure and use that to maximize likelihood of plausible sequences. IMT is brute-force search in an exponentially vast space with no shortcuts.

The monkey has no memory. The LLM is memory—refined into a predictive engine.

Sources

Wikipedia