# Random Words

*Probability and English ... what a mix!*

## Random Letters

You would think it was easy to create random words ... just pick letters randomly and put them together, and voila! a random word.

Well, here are 20 words made that way:

tldkl oewkx dmwol vuptg hvwjk naqid avypr zwtip zgnzs bvdhdmuyfd ighgd xhlng oyecn vjnsl ssjrx gxald tukxj rvfoq yxzxq |

It turns out that the words are not only nonsense, but quite hard to pronounce!

(Try saying "**tldkl**" or "**oewkx**")

You see, the probability is * very unlikely* ... you would have to try lots of random combinations before getting lucky.

Why? Well, English has around 200,000 words *(228,000 in the Oxford English Dictionary, including many words no longer used)* ... but how many different words can be made with just 5 letters?

26 × 26 × 26 × 26 × 26 = **11,881,376** possible 5 letter words!

And that is just the 5 letter words ...

Let us guess that there are 40,000 words in English that have 5 letters. So the probability of making a real word just **randomly** would be:

40,000 / **11,881,376** = 0.003, or about 0.3% chance

So * real words are rare*. And we can see that putting random letters together is very unlikely to produce a real word.

## Vowels

We can improve our success by insisting that a word have at least one vowel, since nearly every word in English has one (except fly, by and a few others). Like this:

ectot gjaqv kuifg vzicu zspsu pdidb wqdis uerrs ucgej okimwfnevz ewxko ljgew aglgo jpfoq dcytu uwkcj dzioy wekdx xuybk |

This is a great improvement. More words can be pronounced.

But there are still lots of strange words like "**zspsu**" and "**xuybk**"

## Letter Frequency

So, our next improvement is to use *less* of the letters like j,x,z and q and *more* of the letters like e,t and s.

In fact the **frequency of letters** in the English Language is well known. Here is how many times you would * expect* to see a letter in every 1,000 letters:

a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z |

82 | 15 | 28 | 42 | 127 | 22 | 20 | 61 | 70 | 2 | 8 | 40 | 24 | 67 | 75 | 19 | 1 | 60 | 63 | 90 | 27 | 10 | 24 | 2 | 20 | 1 |

Can you see that "e" is common, but "z" is rare?

- "e" is lkely to occur 127 times in every 1,000, or as a ratio 127/1000 = .127 (=12.7%)
- "z" is lkely to occur only 1 time in every 1,000, or as a ratio 1/1000 = .001 (=0.1%)

So, by selecting letters based on that frequency (a bit like rolling a 1,000 sided die (dice), where each die has 82 **a**'s, 15 **b**'s ... and only one **z**), we can get output like this:

elnao etgov segty laast aessn siuon oenha eaoas ncoot ctwkadmswo dpuoh eewis ebdni laarm syucs idvos lhina igahh soyie |

Still no real words, but some are close. And most of them can be pronounced. (Great names if you are writing a science fiction novel!)

## Try For Yourself!

You can try all three methods here ... see if you can get lucky and find a real word:

### but we can do better ...

## 2-Letter Frequencies

We can take the idea of Letter Frequency one step further by asking

**"what is the frequency of letters that follow another letter"**

For example, if we already have a "t", the next letter is **very likely** to be an "h" (making "th").

To illustrate this, I built up a Table of Two-Letter Frequencies (from *Alice's Adventures in Wonderland*). Here is the line for "t":

Freq | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

t | 238 | 41 | 727 | 11 | 3197 | 459 | 275 | 18 | 12 | 990 | 149 | 153 | 333 | 125 | 65 | 54 |

So, "h" occured 3197 times after a "t" ("th") ... but "b" **never** followed a "t"

OK, let us start with a "t", and let us say we choose an "h" to make "th", then next we would use the "h"-row to choose another letter (maybe an "e" to make "the"), and so on ... well, here is a sample:

the cur the bund hof arytowno d sheromasees asemedosouro f soacthake d imon binofowat oaten d heng wa |

The results are remarkable ... nonsense, but almost like some strange language.

In fact we are not just making random words now, we are making random **sentences**!

## Higher Letter Frequencies

Why stop there? We can make tables of three letter frequencies or more ...

### 3 Letter Frequencies

How do 3 Letter Frequencies work?

Well, say I already have two letters (like "ei") ... we then:

- look through the sample text for every time "ei" appears,
- randomly choose one of those
- look for the letter following "ei" (possibly "t").
- then add the "t" to make "eit"
- and start again using "it" (... always the last two letters)

Here is a sample:

Either great into get very deep welled of it it, andto wondere started into the book about hear! |

Now, **that** looks good! By sampling from a real source we can get good results.

### 4 Letter Frequencies

Using the same method I used groups of 3 Letters to decide on the 4th letter and got:

happen next. First, she look down mind |

### 5 Letter Frequencies

And with 5 Letter frequencies:

There was just in time it all seemed quite natural);but to take out of time as she had not like to do |

## Try For Yourself!

Yes, I wrote something for you to play with. It has the first 6 paragraphs from *Alice's Adventures in Wonderland*), but **you can put you own source text in there**.

Find something from Shakespeare, or a political speech and see what it comes up with ... you could even combine quotes from different authors to see what their children might write.