Testing Andrew Ng's AI Agent System for Language Translation

Exploring Meaning Retention Across Subsequent Translations

I’m testing Andrew Ng’s AI agent system for language translation and its stability under inverse transformation.

Given text in a source language, we apply the translation function f, then apply the inverse of f.

It should be Identity.

Sabrina Ramonov @ sabrina.dev

But, what does math have to do with language translation?

In a perfect system, translating from English to Spanish to English should not alter the meaning of the source text.

Here’s the youtube version of this post:

Andrew Ng’s Agent System for Translation

Andrew Ng released an open-source AI agent system for language translation.

The agent workflow has 3 steps:

  1. Agent translates from source language to target language

  2. Agent reviews translation, brainstorming ways to improve it (Reflection)

  3. Agent edits initial translation, incorporating feedback from step 2

Here’s the open source github repo.

If you’re new to generative AI agents, I recommend reading this first.

Reflection is a useful agent design pattern. It’s like stepping away from your computer and thinking about how to improve your work; then, when you’re back at the computer, you carry out your plan to revise your work.

With reflection, you let the agent take a break, review its work so far, and the agent brainstorms potential improvements.

This is critical as agents operate autonomously, without human intervention.

By employing reflection, an agent can be self-critical, iterating towards better answers, without you manually telling it what to do.

As Andrew Ng explains in the Readme, there are several benefits of an AI Agent System for translation:

1. Modify the output's style, such as formal/informal.

2. Specify how to handle idioms and special terms like names, technical terms, and acronyms. For example, including a glossary in the prompt lets you make sure particular terms (such as open source, H100 or GPU) are translated consistently.

3. Specify specific regional use of the language, or specific dialects, to serve a target audience. For example, Spanish spoken in Latin America is different from Spanish spoken in Spain; French spoken in Canada is different from how it is spoken in France.

Experiment: 10 Iterations, Short Text

Playing around with the repo, I want to test stability under inverse transformation.

So, I’m going to translate from English to Spanish, back to English, then determine whether the meaning of the source text was preserved.

Here’s my code:

import translation_agent as ta

source_lang, target_lang, source_text = (
    "English",
    "Spanish",
    "Large language models are pretty cool",
)

countries = ["Mexico", "US"]
translation = ta.translate(source_lang, target_lang, source_text, countries[0])

num_iters = 10

for i in range(num_iters):
    country = countries[i % len(countries)]
    t = ta.translate(source_lang, target_lang, source_text, country)
    print(
        f"Iter {i+1}/{num_iters} {source_lang}->{target_lang} [country={country}]. Output: {t}"
    )
    source_lang, target_lang = target_lang, source_lang
    source_text = t

Note: you’ll first need to export the environment variable OPENAI_API_KEY.

Here’s the short text I’m translating:

Large language models are pretty cool

Here’s the full log from running 10 iterations, which translates back and forth from English-to-Spanish-to-English 5 times:

ic| num_tokens_in_text: 6
ic| 'Translating text as single chunk'
ic| num_tokens_in_text: 6
ic| 'Translating text as single chunk'
Iter 1/10 English->Spanish [country=Mexico]. Output: Los grandes modelos de lenguaje son bastante geniales.
ic| num_tokens_in_text: 12
ic| 'Translating text as single chunk'
Iter 2/10 Spanish->English [country=US]. Output: Powerful language models are really cool.
ic| num_tokens_in_text: 8
ic| 'Translating text as single chunk'
Iter 3/10 English->Spanish [country=Mexico]. Output: Los modelos de lenguaje súper potentes están padrísimos.
ic| num_tokens_in_text: 17
ic| 'Translating text as single chunk'
Iter 4/10 Spanish->English [country=US]. Output: Incredibly powerful language models are totally awesome!
ic| num_tokens_in_text: 10
ic| 'Translating text as single chunk'
Iter 5/10 English->Spanish [country=Mexico]. Output: ¡Los modelos de lenguaje increíblemente poderosos son increíbles!
ic| num_tokens_in_text: 18
ic| 'Translating text as single chunk'
Iter 6/10 Spanish->English [country=US]. Output: Language models are unbelievably powerful and just amazing!
ic| num_tokens_in_text: 10
ic| 'Translating text as single chunk'
Iter 7/10 English->Spanish [country=Mexico]. Output: ¡Los modelos de lenguaje son increíblemente poderosos y realmente asombrosos!
ic| num_tokens_in_text: 21
ic| 'Translating text as single chunk'
Iter 8/10 Spanish->English [country=US]. Output: Language models are incredibly powerful and truly amazing!
ic| num_tokens_in_text: 9
ic| 'Translating text as single chunk'
Iter 9/10 English->Spanish [country=Mexico]. Output: ¡Los modelos de lenguaje son increíblemente poderosos y realmente asombrosos!
ic| num_tokens_in_text: 21
ic| 'Translating text as single chunk'
Iter 10/10 Spanish->English [country=US]. Output: Language models are incredibly powerful and truly amazing!

I’ve cleaned it up to see only the English translations:

  1. Powerful language models are really cool.

  2. Incredibly powerful language models are totally awesome!

  3. Language models are unbelievably powerful and just amazing!

  4. Language models are incredibly powerful and truly amazing!

  5. Language models are incredibly powerful and truly amazing!

Sadly, we lost the meaning of “large language models” in the first inverse translation!

Upon reflection (see what I did there), I guess it makes sense…

Our agent doesn’t recognize “large language models” as a distinct technical term that should be translated consistently.

It’s also interesting there seems to be convergence starting at the third inverse translation: #3, 4, 5 are all highly similar.

The recurrence of the word “powerful” also jumps out.

My hypothesis: our agent is blending the concepts “large” and “cool”, resulting in “powerful”.

Experiment: 20 Iterations, Short Text

Next, I run the same experiment with 20 iterations, so we get 10 translations back to English:

  1. Big AI language models are super cool!

  2. The advanced AI language models are truly incredible.

  3. It's truly incredible how advanced language models in artificial intelligence are.

  4. It's really impressive how advanced artificial intelligence language models are.

  5. It's impressive how incredibly advanced the artificial intelligence language models are.

  6. It's impressive to see how much AI language models have advanced.

  7. It's really impressive to see how much AI language models (Artificial Intelligence) have advanced.

  8. It's truly amazing to see how far artificial intelligence (AI) language models have come.

  9. It's truly amazing to see how far artificial intelligence (AI) language models have come.

  10. It's really impressive to see how far AI language models have come.

Again, we lose the meaning of “large language models” in the first pass.

The recurring word is now “advanced” instead of the previous “powerful”.

What’s really interesting is how the definition of “advanced” morphs throughout the iterations:

  • we start with “The advanced AI language models are truly incredible.”

  • then a different definition of “advanced” is used: “It's really impressive to see how much AI language models (Artificial Intelligence) have advanced.”

  • which leads to our last iteration: “It's really impressive to see how far AI language models have come.”

Due to to the deviation in the meaning of “advanced”, by the time we get to the 10th translation back to English, the text has a notably different meaning from the original.

Experiment: Dictionary

I don’t want to have to manually incorporate a glossary or dictionary with the term “large language model”.

I wonder if adding a little context to the prompt will help.

I add this to the translation prompt in steps 1 and 3:

“Preserve any references to technical terms.”

But that didn’t help.

Ok fine, I’ll hardcode a dictionary:

“Preserve any references to technical terms in this dictionary:

["large language models"]”

Here are the translations back to English:

  1. The truth is that large language models are incredibly impressive.

  2. Large language models are truly impressive.

  3. Major language models are really impressive.

Yay!

We see stability under inverse transformation the first 2 times — the term “large language model” is successfully retained.

But we lose it the 3rd time!

At this point, I worry technical and industry-specific terminology remains a non-trivial obstacle for machine translation.

Experiment: 6 Iterations, Long Text

So, let’s test the agentic system on long text that does not contain jargon.

Here’s a post from one of my fave X users I like to read to feel warm and fuzzy:

I run my script for 6 iterations, so we get 3 translations back to English.

Here’s the 1st translation back to English:

Here’s the 3rd translation back to English:

Overall, super impressed with all 3 translations back to English!

My only 2 tidbits you can see in the 3rd final translation:

  • the agent changed the plurality from multiple “you wake up around people you love” to singular “you wake up in the comforting presence of your loved one”. In my opinion, this changes the original meaning of the text, as the latter singular case evokes a spouse, whereas plurality evokes multiple family members

  • in the last sentence, the agent mistranslated the word understanding as control, which in my opinion changes the original meaning of the text. There’s a difference between understanding the nature of reality vs. trying to control it.

Experiment: Russian

For fun, I switch the target language to Russian and re-run with the short text, as well as the dictionary embedded in the prompts:

“Large language models are pretty cool”

Again, the word “advanced” rears its recurring head.

Even with the dictionary embedded in the prompt, the translation loses the technical concept of “large language model” on the 4th inverse translation.

Recall, we also observed this with the Spanish translation.

Closing Thoughts

This is my first time playing with Andrew Ng’s translation-agent system.

I’m impressed how well it translated the long passage, without technical terms, and for the most part retained the meaning of each phrase. I tried it on 6 iterations, so translated back to English 3 times.

Each translation was not only passable, but impressively aligned with the intent and style of the original text.

However, technical and industry-specific terms present a major challenge. In several experiments, we see the technical concept of “large language model” lost in translation. Even after adding a dictionary, we did not achieve stability under inverse transformation.

I’m excited to follow the development of this project!

From the github repo’s Readme, here are ways to improve this AI agent system:

Have fun building!

Sabrina Ramonov

P.S. If you’re enjoying the free newsletter, it’d mean the world to me if you share it with others. My newsletter just launched, and every single referral helps. Thank you!