Codehead's Corner
Random ramblings on hacking, coding, fighting with infrastructure and general tech
Wordle Analysis
Posted: 6 Feb 2022 at 13:26 by Codehead

Intro

Following on from breaking Wordle in my earlier post, I decided to use the data extracted from the app to try and work out the best starting words for the game.

I’ve seen a few articles about people’s chosen starter which is often based on vowel-heavy words. However, I have all the solutions and the valid words, so I can run some analysis and select a statistically accurate answer rather than guessing a word.

There is a lot of analysis, charts and statistical calculation below. The idea is to walk through the process so you understand why the words were selected. However, if you just want the results, scroll to the bottom of the page.

Facts and Figures

The solution list contains 2315 words. This means that Wordle has enough daily answers to run until Oct 21, 2027. The recent news that the game has been bought by the New York Times had people rushing to save a local copy of the game to play for free ‘forever’. Looks like we only have 5 years worth of games unless the answer list is extended.

The valid word list is much bigger at 10657 words. Some of the entries are pretty bizarre, so it wouldn’t be a good idea to use those as an extended solution list. However, if this was the answer list, the game could run until Aug 23, 2050. Only really an option if you’re happy with answers like: “aiyee”, “akkas”, “buhls”, “dzhos” and “thagi”.

Simple Hit Test

The first test is a quick pass through to average out the ‘hit count’ for valid words against the solutions. If we take each word in the valid word list and count how many letters it reveals (either an orange or green square) in each solution word, we should get some idea of the usefulness of each valid word.

The top 5 best words based on the average number of letters revealed are:

RankWordScore
1AREAE2.05917
2ARERE / RAREE2.02807
3RESEE1.99697
4AREAR1.96457
5ARETE / REATE1.95464
6LAREE / LEARA / LEEAR1.94643
7AERIE1.94600
8EASER / SAREE / SEARE1.93347
9LEESE1.91533
10ARENE / RANEE1.90410

There are a few entries which are anagrams, these have the same score for all permutations. This happens quite a lot in the valid word list.

At the other end of the scale, the 10 worst words are:

RankWordScore
2315BUZZY0.52311
2314MUZZY0.53650
2313WHIZZ0.55723
2312HUZZY0.57149
2311PZAZZ0.58747
2310BIZZY0.60518
2309MIZZY0.61857
2308PHIZZ0.62289
2307JUMBY0.63326
2306ZUZIM0.63585

If you have BUZZY as your start word, you may wish to reconsider your life choices.

Duplicate Letter Removal

This is a good start, but there are duplicate letters in our answers. That’s not an efficient guess, we really want 5 unique letters.

Repeated letters also distort the calculations. The score is raised by two when we have only found a single letter. To fix this, we can tweak the code so that repeated letters only score once.

Running the test again gives us these new results:

RankWordScore
1OATER / ORATE / ROATE1.78920
2REALO1.78099
3ARTEL1.77840
4ARTEL / RATEL / TALER1.77840
5RETIA / TERAI1.77796
6ARIEL / RAILE1.76976
7AEROS / SOARE1.76803
8ARETS / ASTER / EARST / RATES / REAST / RESAT / STEAR / STRAE / TARES / TASER / TEARS / TERAS1.76544
9ARETS / ASTER1.76544
10ARLES / EARLS / LAERS / LARES / LASER / LEARS / RALES /REALS / SERAL1.75723

There are a lot of anagrams here, even in the top entry. We will need a method to find the best one later.

At the other end of the table we have some really odd words:

RankWordScore
2315QAJAQ0.41684
2314IMMIX0.42419
2313ZOPPO0.45529
2312GYPPY0.45917
2311KUDZU0.45961
2310SUSUS0.46436
2309YUKKY0.46479
2308FUFFY0.46695
2307JUGUM0.46738
2306JUJUS0.47602

Duplicate letter words feature heavily at the bottom the table. ‘SUSUS’ is effectively a two letter guess and predictably scores quite badly.

Frequency Analysis

Now we know which words are most successful at finding letters, but which letters should we look for? A valuable technique that is often used in cryptographic CTF challenges is frequency analysis. In simple terms this is simply counting the occurrence of each letter in a body of text to find out which ones appear most.

Here’s the results of a quick scan, we have some clear winners:

image

If we sort the results, we see that the top five letters are E, A, R, O and T.

image

This ties in with the previous analysis results. Our top words ‘OATER’, ‘ORATE’ and ‘ROATE’ contain those top 5 letters. We should definitely be prioritising ‘E’, ‘A’ and ‘O’ when vowel hunting. ‘I’ and ‘U’ are much further down the list.

These results are quite interesting. We can see that the Wordle word list does not conform to the usual frequency distribution that we see in regular English text or even in other word lists like a dictionary. Here’s our Wordle numbers as percentages matched up to typical English text and dictionary figures:

image

At first glance, the numbers seem to be fairly close. However, if we sort by English text frequency, we get a different top 5 to our previous results:

image

In the Wordle results, ‘R’ is the 3rd most common letter, but in English text it is much lower at 9th place. Also, ‘I’ has jumped up to the top 5 in this plot.

Sorting by Dictionary frequency provides an even more contrasting top 5:

image

Previous frontrunners ’T’ and ‘O’ have dropped right down and ‘I’ is even higher here. The vowel priority is completely scrambled.

OK, Lots of Charts. So What?

image

This data shows that we cannot rely on our usual assumptions about popular letters and words. Guesses based on our day-to-day experience of letter occurrence and frequency are likely to be less successful than normal. We have shown that Wordle’s distribution does not conform to expected norms and we should choose our starting words more carefully.

Refining The Results

We know from the de-duplicated results and frequency analysis that some form of ‘E’, ‘A’, ‘R’, ‘O’ and ’T’ are the most likely letters to appear in the solution.

If we can’t work out the answer after the first guess using those letters, we can work down the list trying the next most common letters in order: ‘L’, ‘I’, ’S’, ‘N’, ‘C’ then ‘U’, ‘Y’, ’D’, ‘H’, ‘P’.

If we use three guesses, we will have covered the 15 most popular letters and should be pretty close to the solution.

We have a choice of words for the top 5 letters and the first guess, but the other two are a little more problematic. The only words I could come up with that fitted all ten letters with no repetition were: ‘LINCH’ and ‘PUDSY’.

Finally, lets see if there’s a difference in the three versions of the top words; ‘OATER’, ‘ORATE’ and ‘ROATE’.

This final test works out the average score for each variant based on exact letter matches, i.e. green squares.

The results were:

RankWordScore
1ROATE0.54168
2ORATE0.50885
3OATER0.42591

Conclusion

So after much analysis and number crunching we have our three starting words:

ROATE
LINCH
PUDSY

These three words are valid in Wordle, they cover the most frequently occurring letters and test all vowels with no repetition.

Hopefully you won’t need all three, but if you do, you’ll be in good shape to make solid guesses with your last three lines.



Site powered by Hugo.
Polymer theme by pdevty, tweaked by Codehead