Telegrams
A few days ago at work, a friend of mine came up with this problem. I call it the “telegram” problem: you get a string consisting of words run together without spaces, and figure out where spaces can go to separate out the words. For example, a famous misunderstanding:
parse_telegram("turkeytrotstowaterwhereistaskforcethirtyfourtheworldwonders") -- turkey trots to water where is task force thirty four the world wonders
I wrote a solution last night.
The dictionary
First things first, we’re going to need a dictionary of words. If you’re on a Mac or most Linux distros you already have one, at /usr/share/dict/words
. If not, I recommend Googling for the OSPD, the Official Scrabble Player’s Dictionary.
The dictionary comes to us as a text file with one word per line. Mine, at least, contains a bunch of proper nouns, words with apostrophes, and single-letter words, which I filtered out (adding back “i” and “a,” the only two single-letter words that really exist).
We need to turn this into some kind of data structure we can easily query. Specifically, we want to quickly find out if there are any words that begin with a certain substring, for example, there are words that begin with “appl” but none that begin with “appq.” The data structure we’ll use for this is called a trie. It’s a tree with each node representing a letter, and the edges leading out of the node representing letters that could follow this one. So, for example, here’s a trie containing the words “an,” “ant,” “app,” “apple,” and “pear:”
A “$” node represents the end of a word, so, if a node has a “$” as a child, that means that letter can complete a word (but doesn’t have to, if there are other children).
So, here’s a function that reads a dictionary file, filters out the words we don’t want, and inserts them into a trie:
function load_dictionary(filename) local dictionary = {} for line in io.lines(filename) do if line:match("^%l%l+$") then insert(dictionary, line .. "$") end end -- We're cutting all single-letter words, but -- two of them are real words insert(dictionary, "a$") insert(dictionary, "i$") return dictionary end
And here’s the corresponding insert
function, to actually put them in the trie:
function insert(dictionary, word) local first = word:sub(1,1) dictionary[first] = dictionary[first] or {} if first ~= "$" then insert(dictionary[first], word:sub(2)) end end
Finding all possible words
The first step to my solution is to find all the possible words in the telegram, matched to their place in the string. So, for the telegram “hellothere” we will make a list of words like this:
hellothere he hell hello -ell ---lo ---lot ---loth ----other -----the -----there ------he ------her ------here -------ere --------re
(Your list may be slightly different using the OSPD instead of Ubuntu’s dictionary like I am.)
find_words
takes a telegram and a trie and returns a list of {position, word}
tuples for each word. The way it works is pretty straightforward: for each letter in the telegram, search the trie for words that start with that letter. So, we look for a “$” node under the first “h,” if we don’t find one we move down to “e” and then look for a “$” node, if we find one we insert “he” and then move down to “l,” and so on, until we reach a point where the current trie node doesn’t have a child for the next letter in the telegram. Then we stop and try again with the next starting letter.
function find_words(telegram, dictionary) local words = {} -- list of tuples, {start, word} for start = 1, #telegram do local current_index = start local current_letter = telegram:sub(start, start) local current_node = dictionary while current_node[ current_letter ] do current_node = current_node[ current_letter ] if current_node['$'] then table.insert(words, {start, telegram:sub(start, current_index)}) end current_index = current_index + 1 current_letter = telegram:sub(current_index, current_index) end end return words end
Culling the word list
Looking at our list of words, it should be clear that some of these can’t be part of any solution. For example, look at the word “lot” starting at position 4: any solution that contains “lot” must also contain a word that ends at position 3, as well as one that begins with position 7 (the left and right neighbors of “lot”). “Here” begins at 7, so we have a right neighbor, but nothing ends at 3, so, “lot” cannot be part of a solution.
We need to go through the list of tuples and remove any words that don’t have left and right neighbors. Of course, those words we remove might themselves be the neighbors of other words, meaning that removing them might leave those words without a neighbor. So we need to keep culling the list until there’s nothing left that can be removed.
We’ll make this fast by keeping a count of all the words that start and end at each position, and checking each word to ensure that that count is positive for each of them.
function cull_words(words, telegram_length) -- Map from start / end position, to word count local by_start = setmetatable( {}, {__index=function() return 0 end} ) local by_end = setmetatable( {}, {__index=function() return 0 end} ) -- Initialize these counts for _, w in ipairs(words) do local l = w[1] local r = w[1] + #(w[2]) - 1 by_start[l] = by_start[l] + 1 by_end[r] = by_end[r] + 1 end -- Loop until we run out of things to remove while true do local new_words = {} for i, w in ipairs(words) do local left = w[1] local right = w[1] + #(w[2]) - 1 local left_neighbor = (left == 1) or (by_end[left-1] > 0) local right_neighbor = (right == telegram_length) or (by_start[right+1] > 0) if left_neighbor and right_neighbor then table.insert(new_words, w) else by_start[left] = by_start[left] - 1 by_end[right] = by_end[right] - 1 end end if #new_words == #words then break end words = new_words end return words end
Enumerating the solutions
At this point, everything in our word list can be part of a solution. We can organize our words into a tree, like so:
iamatelegram i -a -am --ma --mate ---a ---ate ----telegram ------leg ---------ram
Here’s a simple function to do that:
function build_word_tree(words) local function find_by_start(start) local arr = {} for _,w in ipairs(words) do if w[1] == start then table.insert(arr, w[2]) end end return arr end local function tree_for_start(start) local tree = {} for _, w in ipairs(find_by_start(start)) do tree[w] = tree_for_start( start + #w ) end return tree end return tree_for_start(1) end
Now, every path through this tree is a possible solution. So, we simply do a depth-first traversal of the tree and list out all the paths:
function print_tree(word_tree) local strings = {} -- The paths through the tree local function dft(path, current_node) for word, children in pairs(current_node) do local new_path = path and path .. " " .. word or word if not next(children) then table.insert(strings, new_path) else dft(new_path, children) end end end dft(nil, word_tree) for _, s in ipairs(strings) do print(s) end end
Putting it together
Now that we have all the tools, we just run them all together:
function parse_telegram(telegram) local words = find_words(telegram, dictionary) local culled = cull_words(words, #telegram) local tree = build_word_tree(culled) print_tree(tree) end
Let’s test it out:
parse_telegram("turkeytrotstowaterwhereistaskforcethirtyfourtheworldwonders") -- turkey trots to water where is task force thirty four the world wonders
Again, you may get more possible solutions if you use a different dictionary.
Code, next steps
As always, the complete code is on Github. If you want to play with it, try messing around with the print_tree
function: look for the path through the tree with the highest average word length (hint: the total word length will always be constant).