Me Translate Pretty One Day
Spanish to English? French to Russian? Computers haven't been up to the task. But a New York firm with an ingenious algorithm and a really big dictionary is finally cracking the code.
By Evan Ratliff
WHEN MEANINGFUL MACHINES first tested its Spanish-English engine on the BLEU scale in spring 2004, "it came in at 0.37," recalls the company's CEO, Steve Klein. "I was pretty dejected. But Jaime said, 'No, that's pretty good for flipping the switch the first time.'" A few months later, the system had jumped above 0.60 in internal tests, and by the time of Carbonell's presentation in August, the score in blind tests was 0.65 and still climbing. Although the company didn't test the passage with any statistical-based systems, when it tested Systran and another publicly available rule-based system, SDL, on the same data, both scored around 0.56, according to Carbonell's paper. Meaningful Machines was in stealth mode at the time, protecting its ideas. But Carbonell was itching to talk about his results. He didn't just have an engine that he says earned the highest BLEU score ever recorded by a machine. He had an engine that had done it without relying on parallel text.
Instead, the Meaningful Machines system uses a large collection of text in the target language (in the initial case it's 150 Gbytes of English text derived from the Web), a small amount of text in the source language, and a massive bilingual dictionary. Given a passage to translate from Spanish, the system looks at each sentence in consecutive five- to eight-word chunks. The al Qaeda message analysis, for example, might start with "Declaramos nuestra responsabilidad de lo que ha ocurrido." Using the dictionary, the software employs a process called flooding to generate and store all possible English translations for the words in that chunk.
Making this work effectively requires a dictionary that includes all of the possible conjugations and variations for every word. Declaramos, for example, offers up "declare," "declared," "declaring," "stating," and "testifying," among others. Meaningful Machines' Spanish-to-English dictionary, a database with about 2 million entries (20 times more than a standard Merriam-Webster's), is a lexical feat in and of itself. The company outsourced the task to an institute run by Jack Halpern, a prominent lexicographer. The result is one of the largest bilingual dictionaries in the world.
The options spit out by the dictionary for each chunk of text can number in the thousands, many of which are gibberish. To determine the most coherent candidates, the system scans the 150 Gbytes of English text, ranking candidates by how many times they appear. The more often they've actually been used by an English speaker, the more likely they are to be a correct translation. "We declare our responsibility for what has occurred" is more likely to appear than, say, "responsibility of which it has happened."
Next, the software slides its window one word to the right, repeating the flooding process with another five- to eight-word chunk: "nuestra responsabilidad de lo que ha ocurrido en." Using what Meaningful Machines calls the decoder, it then rescores the candidate translations according to the amount of overlap between each chunk's translation options and the ones before and after it. If "We declare our responsibility for what has happened" overlaps with "declare our responsibility for what has happened in" which overlaps with "our responsibility for what has happened in Madrid," the translation is judged accurate.
So what happens if the dictionary is missing words or if the overlap technique can't find a match? A third process, called the synonym generator, is used to search for unknown terms in the smaller Spanish-only set. When it finds them, it drops the original term and searches for other sentences using the surrounding words. The process is easiest to understand with an example in English. When run through the synonym generator, the phrase "it is safe to say" might turn up results like "it is safe to say that within a week" or "it is safe to say that even a blind squirrel ..." By removing "it is safe to say" from each sentence and then searching for other terms that fit the surrounding words, the generator suggests results like "it is important to note" or "you will find" – instead of, for example, "it is unhurt to speak."
The system, Carbonell tells me, is "simple … anybody can understand it." It's so simple, in fact, that Carbonell is peeved that he didn't think of it first. BORN IN URUGUAY, Jaime Carbonell moved to Boston with his family when he was nine. He later enrolled at MIT, where he found part-time work translating Digital Equipment Corporation computer manuals into Spanish to help pay tuition. In an attempt to speed up the translation process, he built a small MT engine that ran the documents through a glossary of common DEC terms, substituting the translations automatically. The little system worked so well that Carbonell continued to dabble in it while earning his computer science doctorate at Yale University. After coauthoring a paper outlining a new type of rule-based MT, he was offered a professorship at Carnegie Mellon. There he helped develop a successful commercial rule-based translation system. Then he hopped on the wave of text-based MT in the '90s.
One afternoon in 2001, Carbonell got a cold call from Steve Klein, a lawyer, hotel investor, and occasional film writer and director. Klein said that he'd formed a partnership with an Israeli inventor named Eli Abir – a man with little school or technical training who previously ran a restaurant. Abir, according to Klein, had a new machine-translation idea they wanted Carbonell to evaluate. Klein had been one of the first people to take the garrulous Abir seriously when he began hitting up investors for a previous invention in 2000, often in jeans and a T-shirt, claiming credentials as "the worst student in the history of the Israeli school system." Abir, who is bilingual in Hebrew and English, also said he could solve several of the world's thorniest computer science problems, based in part on knowledge gained from three days of playing SimCity.
Suspicious but curious, Carbonell agreed to meet the pair. When they arrived in his office and Abir explained the concept for what is now called the decoder, Carbonell was floored by its elegance. "In the few weeks that followed, I kept wondering, 'Why didn't I think of that? Why didn't the rest of the field think of that?' Finally I said, Enough of this envy. If I can't beat them, join them." :-)
http://www.wired.com/