Random Words

Probability and English ... what a mix!

Random Letters

You may think it easy to create random words ... just pick letters randomly and put them together, and voila! a random word.

Well, here are 20 words made that way:

tldkl oewkx dmwol vuptg hvwjk naqid avypr zwtip zgnzs bvdhd
muyfd ighgd xhlng oyecn vjnsl ssjrx gxald tukxj rvfoq yxzxq

It turns out that the words are not only nonsense, but quite hard to pronounce!

(Try saying "tldkl" or "oewkx")

You see, the probability is very unlikely ... you would have to try lots of random combinations before getting lucky.

Why? Well, English has around 200,000 words (228,000 in the Oxford English Dictionary including many words no longer used) ... but how many different words can be made with just 5 letters?

26 × 26 × 26 × 26 × 26 = 11,881,376 possible 5 letter words!

And that is just the 5 letter words ...

Let us guess that there are 40,000 words in English that have 5 letters. So the probability of making a real word just randomly would be:

40,000 / 11,881,376 = 0.003, or about 0.3% chance

So real words are rare. And we can see that putting random letters together is very unlikely to produce a real word.

Vowels

We can improve our success by insisting that a word have at least one vowel, since nearly every word in English has one (except fly, by and a few others). Like this:

ectot gjaqv kuifg vzicu zspsu pdidb wqdis uerrs ucgej okimw
fnevz ewxko ljgew aglgo jpfoq dcytu uwkcj dzioy wekdx xuybk

This is a great improvement. More words can be pronounced.

But there are still lots of strange words like "zspsu" and "xuybk"

Letter Frequency

So, our next improvement is to use less of the letters like j, x, z and q and more of the letters like e, t and s.

In fact the frequency of letters in the English Language is well known. Here is how many times you would expect to see a letter in every 1,000 letters:

a	b	c	d	e	f	g	h	i	j	k	l	m	n	o	p	q	r	s	t	u	v	w	x	y	z
82	15	28	42	127	22	20	61	70	2	8	40	24	67	75	19	1	60	63	90	27	10	24	2	20	1

Can you see that "e" is common, but "z" is rare?

"e" is lkely to occur 127 times in every 1,000, or as a ratio 127/1000 = .127 (=12.7%)
"z" is lkely to occur only 1 time in every 1,000, or as a ratio 1/1000 = .001 (=0.1%)

So, by selecting letters based on that frequency (a bit like rolling a 1,000 sided die (dice), where each die has 82 a's, 15 b's ... and only one z), we can get output like this:

elnao etgov segty laast aessn siuon oenha eaoas ncoot ctwka
dmswo dpuoh eewis ebdni laarm syucs idvos lhina igahh soyie

Still no real words, but some are close. And most of them can be pronounced. (Great names if you are writing a science fiction novel!)

Try For Yourself!

You can try all three methods here ... see if you can get lucky and find a real word:

but we can do better ...

2-Letter Frequencies

We can take the idea of Letter Frequency one step further by asking

"what is the frequency of letters that follow another letter"

For example, if we already have a "t", the next letter is very likely to be an "h" (making "th").

To illustrate this, I built up a Table of Two-Letter Frequencies (from Alice's Adventures in Wonderland).

Here is the line for "t":

Freq	a	b	c	d	e	f	g	h	i	j	k	l	m	n	o	p	q	r	s	t	u	v	w	x	y	z
t	238		41		727	11		3197	459			275	18	12	990			149	153	333	125		65		54

So, "h" occured 3197 times after a "t" ("th") ... but "b" never followed a "t"

OK, let us start with a "t", and let us say we choose an "h" to make "th", then next we would use the "h"-row to choose another letter (maybe an "e" to make "the"), and so on ... well, here is a sample:

the cur the bund hof arytowno d sheromasees asemedosouro f
soacthake d imon binofowat oaten d heng wa

The results are remarkable ... nonsense, but almost like some strange language.

In fact we are not just making random words now, we are making random sentences!

Higher Letter Frequencies

Why stop there? We can make tables of three letter frequencies or more ...

3 Letter Frequencies

How do 3 Letter Frequencies work?

Well, say I already have two letters (like "ei") ... we then:

look through the sample text for every time "ei" appears,
randomly choose one of those
look for the letter following "ei" (possibly "t").
then add the "t" to make "eit"
and start again using "it" (... always the last two letters)

Here is a sample:

Either great into get very deep welled of it it, and
to wondere started into the book about hear!

Now, that looks good! By sampling from a real source we can get good results.

4 Letter Frequencies

Using the same method I used groups of 3 Letters to decide on the 4th letter and got:

Either the sides or conversations in time to
happen next. First, she look down mind

5 Letter Frequencies

And with 5 Letter frequencies:

There was just in time it all seemed quite natural);
but to take out of time as she had not like to do

Try For Yourself!

Yes, I wrote something for you to play with. It has the first 6 paragraphs from Alice's Adventures in Wonderland), but you can put you own source text in there.

Find something from Shakespeare, or a political speech and see what it comes up with ... you could even combine quotes from different authors to see what their children might write.

And Beyond

What if we were to take an entire encyclopedia, and choose not just sequences of letters, but of word fragments. And they don't have to be in order but just nearby each other.

Would it Generate a good response using Pretrained data, by Transforming it? Is that it a GPT?

a	b	c	d	e	f	g	h	i	j	k	l	m	n	o	p	q	r	s	t	u	v	w	x	y	z
82	15	28	42	127	22	20	61	70	2	8	40	24	67	75	19	1	60	63	90	27	10	24	2	20	1

a	b	c	d	e	f	g	h	i	j	k	l	m	n	o	p	q	r	s	t	u	v	w	x	y	z
82	15	28	42	127	22	20	61	70	2	8	40	24	67	75	19	1	60	63	90	27	10	24	2	20	1

a	b	c	d	e	f	g	h	i	j	k	l	m	n	o	p	q	r	s	t	u	v	w	x	y	z
82	15	28	42	127	22	20	61	70	2	8	40	24	67	75	19	1	60	63	90	27	10	24	2	20	1