Slightly related to my previous post, one of my tasks today was to translate a list of keywords into different languages. The inputs I received generally looks something like below. There were many more, but I truncated the bulk of it to provide a simpler example:
- Institutions: disputes, grievance, court, penal, judicial, prosecutorial, punishment, lawyer, attorney
- Infrastructure: railways, traffic, trucks, cars, vehicles, road, taxi, public transportation, port, logistics, collision, passenger
[...]
Reshape the text into a processible format
Firstly, I need to transform these texts into a better format. A good text editor certainly helps. I was able to quickly convert these texts into csv
format, using Emacs’ macro functionality and Vi’s keybindings.
The functionalities provided by good text editors really shine when small, spontaneous text manipulations like this is involved. Sure, when text inputs are big and repetitive, you should write a script to handle it, but for me, these small-to-average operations on texts are the bulk of my work.
Using googletrans
to automate translations
For automation of translations, a quick search pointed me to the module googletrans
(PyPI repo). Here is the disclamer from the package’s author:
DISCLAIMER: this is an unofficial library using the web API of translate.google.com and also is not associated with Google.
Thus, the package doesn’t work perfectly. On my first few attempts, the API refused to translate many words and returned them untranslated. I ended up write a custom wrapper function around googletrans.Translator.translate()
that re-request the translation if it fails.
The final script looks something like this:
#!/usr/bin/env python3
import time
import pandas as pd
from googletrans import Translator
kwd = pd.read_csv("keywords-en.csv")
langs = ["ar", "id", "ml", "bg", "ca", "zh-CN", "zh-TW", "cs", "da",
"nl", "fi", "fr", "de", "hu", "it", "ja", "ko", "no", "pl",
"pt", "ru", "sk", "es", "sv", "th", "tr", "vi"]
def ggtrans(text, src, dest):
"""Translate `text` from `src` to `dest` language"""
while True:
try:
trans = Translator()
result = trans.translate(text, src = src, dest = dest).text
print("Translated: {:>20} -> {:>20}".format(text, result))
except Exception as e:
print("Error: {} from {} to {}".format(text, src, dest))
print(e)
time.sleep(5)
continue
break
return(result)
for lang in langs:
kwd[lang] = kwd['Keyword'].apply(ggtrans, src = "en", dest = lang)
kwd[lang+"-retrans"] = kwd[lang].apply(ggtrans, src = lang, dest = "en")
print("Writing to CSV")
kwd.to_csv("translations.csv", index = False)
Admittedly, the script may take a long time to finish, because it sends one (or several) new request to the web API of translate.google.com for each and every translation it performs. But only then I can make sure that all of my words are properly translated.