# Creating a Word Frequency Dictionary

While going through the Hulu interview process I was presented with an 8 hour programming challenge that consisted of creating a program capable of playing a game of hangman. While easy enough at face value the challenge presented several major obstacles. The one I will discuss here is my decision to use a word frequency dictionary, the realization that there existed no such thing (sort of) and the eventual creation of the dictionary.

To humans hangman is quite simple. It typically follows this methodology:

1. Begin by guessing a random letter that occurs with high frequency
2. Continue to guess letters until enough appear in a word that the word can be guess to a high degree of probability
3. Guess letters in the word aiming to fill blanks in not only that word but surrounding words as well
4. Once a word has been completed and no other words are apparent go back to step 1

I have dubbed this “The Battleship approach”. Randomly move through the possibilities looking for a “hit”. Once a hit is found it’s only a matter of time before the ship is completely uncovered from guessing the most likely locations. The problem with implementing such an approach on a computer is the jump between step 2 & 3. While humans are quite capable of finding a word that fits a series of blanks and letters a computer has no directly applicable skill to fill that role. Answer: regular expressions.

A regular expression (abbreviated regex) is a sequence of characters that form a search pattern that can then be compared directly against words or sentences to check equivalence.  With regular expressions it is possible to see that “_eavier” can be only 2 different words in the english language. They are heavier and leavier. Obviously, this is quite useful. Once this point has been reached it is a trivial matter to count the frequency of the possible letters and guess the one that occurs most often.

At this point all is fine and well, but there is a problem. Suppose as before the computer is trying to guess the final letter in the word “_eavier”. After looking at the possible letters to guess the computer would have to make a choice. Should it guess an ‘l’ or an ‘h’?  Looking at those possibilities you or I could could give an answer instantly. The computer, however, can make no such judgment. So how do you fix this? Word frequency. It’s the reason you or I can look at “_eavier” and pick ‘h’. We both know heavier is used more often than leavier.

To leverage word frequency in my program I was going to need a word list correlated with the frequency of those words as used in everyday language. I looked all over but was not able to find exactly what I needed. Complicating the matter is the problem of plurals and tenses. A standard dictionary (such as /usr/share/dict/words) contains only one form of each word and would be ineffective in a game where every variation of a word can be used. So I would have to make my own.

To start I needed a large sample of english language text. To prevent bias towards any specific type of words the corpus needed to span a wide number of topics and be written by many different authors. Enter Wikipedia.

To download Wikipedia used a torrent from this website. http://kopiwiki.dsd.sztaki.hu

Once complete I unzipped each of the files and used the “cat” command to merge them into one massive (18gb) text file.

`cat ./*txt > wikipedia.txt`

Since the text file contained a fair amount of garbage formatting data I attempted to use a vim script to strip and none article text from the document.

`vim -e -s wikipedia.txt < change.vim`

change.vim looked like this

```g/^_/d
g/^#/d
g/^=/d
g/^*/d
g/^[/d
g/^:/d
g/^!/d
g/^|/d
g/^&lt;/d
g/^{/d
g/^;/d
g/^&amp;/d
write
quit```

After some period of time this resulted in a memory error. It seemed my machine was’t quite capable of loading the whole file. My new plan to make the operation more manageable was to split the file back up into smaller parts.

`split -1 1000000 wikipedia.txt > new`

This produced 300 some files with 1,000,000 lines each.

To make the files easier to work with (read: to satisfy my OCD) I added .txt extensions.

`find . -type f -exec mv {} {}.txt;`

With that settled I wrote a small batch script to run change.vim on each of the 300 files.

```for file in *.txt; do
vim -e -s \$file < change.vim
echo \$file
done```

This seemed to work much better and in about 20 minutes all files were free of nasty non-article text.

I combined the files back into one

`cat ./*.txt > cleanWikipedia.txt`

and went about writing a word counter.

While counting the frequency of every word in the file would have been possible I felt that doing so could skew the results. In such a large document there are bound to be typos, numbers, and other inconsistencies that were not needed. To get around this I used this word list: http://www.morewords.com/enable2k.txt. In addition to containing every word in the english language it also contains every form.

I wrote the following python program to run through cleanWikipedia.txt and count the occurrences of each word in the morewords.com word list. At this point in the process cleanWikipedia.txt contained around 1,314,517,865 words so the execution of this script took ~15 minutes.

```res = {}
dictionaryWords = [line.strip() for line in open('theDictionary.txt')]
for spots in dictionaryWords:
res[spots] = 0
with open('cleanWikipedia.txt') as f:
for line in f:
thisline = line.split(" ");
for werd in thisline:
if res.has_key(werd.lower()):
res[werd.lower()] = res[werd.lower()]+1
f = open('weightedDictionary.txt','w')
for wird in dictionaryWords:
f.write('%s,%d\n'%(wird,res[wird]))
f.close()```

The output of the script can be seen here: https://raw.github.com/jackchammons/wordFrequency/master/weightedDictionary.txt. I chose not to convert the word count to frequency mostly because I like dealing with integers and the effect is the same.

At his point I had everything that I needed to give the challenge a solid attempt. After a few more hours I managed to achieve a consistent 63% success rate at solving the hangman games. To improve this I would have included formal nouns into the list of words I counted in cleanWikipedia.txt, they seemed to pop up ~25% of the time and generally wreaked havoc when they did.

Out of respect for Hulu I am not posting the source for my final solution because they asked me not to and I would like to work there someday.

Link to Github: https://github.com/jackchammons/wordFrequency