I know about the Google Translate API in python. However, in my data frame, there are only a few entries that have 'Hindi' language. How do I recognize the language of these records and then translate them to English.
Basically, I want to do the following.
if !hindi, continue else translate from hindi to english.
I am using this - https://pypi.org/project/googletrans/
So, I realized that default is English. The translate command automatically converts other languages to English and if it is English, it stays as it is.
Related
I am currently working on a more ore less little project in Python, where I build somewhat of an voice assistant that interacts with some Gaming APIs like the Destiny2 API.
The big problem I am running into is the Recognition of the usernames (gamertags) like, for example: Ultra_Luck_y which the speech_recognition module for python I am using clearly doesn't understand. So it just returns Ultra Lucky.
I also tried spelling it, but i automatically got put together to words.
So my question is wether there is a solution (no matter how crappy) or not and I have to go a different way about this?
Thanks to Azamat Galimzhanov, i solved it with the the Radio Alphabet, but kind of edited it so the Words were a little shorter.
I simply put the references all in one json document and loaded them as a dictoniary with:
with open('usernames.txt') as json_file:
alphabet = json.load(json_file)
I have one problem. Let me try to explain this little problem.
I use transliterate library in my Django project. User can write english (latin) or russian (cyrillic) letters in field. If user write russian words it change word to latin letters but if user write english words I see next error:
LanguageDetectionError: Can't detect language for the text "document" given.
I use this code:
transliterate.translit(field_value, reversed=True)
Also I notice that in that project its impossible to detect english language, isn't it?
transliterate.detect_language(field_value) return None when user enter english word.
My aim is to transliterate only if user wrote russion word, but don't touch it user wrote english word. What can you advice?
Right now I found library which can help me to detect language: https://pypi.python.org/pypi/langdetect
Who worked with this library?
Could you try detecting English and then moving on to assume Russian? I put some Russian news articles into the Python code listed below. It clearly detected that it is not English.
It's pretty simple code that can be easily applied.
isEnglish github
I am new to programming, and I am trying to understand transliteration - like the Google Input Tools that will allow the user to type from one language to another language.
How does transliteration work? Specifically, if I am translating from English to Hindi or English to Russsian, do I need to incorporate a dictionary of words for English, Hindi and Russian languages?
Does any one know of any tutorials showing how to write the code for transliteration? I have tried searching, but no luck.
Also, does the code have to be in JavaScript/JQuery (client side code)? My project is Python/django. Can I write the transliteration code in python/dgango?
Thanks.
Direct dictionary-to-dictionary automatic translation produces poor results due to differences in grammar and the presence of idiomatic sentences. The starting point in python, in my experience, should be NLTK (Natural Language ToolKit) libraries and tutorials.
Then, trying to provide you a working example you may start from here:
Machine Translation using babelize_shell() in NLTK
Translating human languages in Python
Google is your friend
Bing is your friend
The use of javascript/jquery depends on the UI you are planning, maybe you want to trigger an automatic translation after a few key pressed, or onblur or onchange in a input tag but is not relevant for the translation itself.
The process of translating is also really resource consuming, so I discourage you to do it inside a django view. My suggestion is to not reinvent the wheel, and use some already existing API like google or bing ones.
I found that the better search term is Input Method Editor not transliteration.
There is a project on github here: https://github.com/wikimedia/jquery.ime that deals with IME's and transliteration here.
I hope that this helps some one.
The typical way of implementing transliteration is to use a mapping dictionary. An example of this can be seen in the mapping.py file for the CyrTranslit Python package.
Word translation usages a database to convert English word into Hindi Word.
Some apps are based on this concept like:
English to Hindi Dictionary
I have been wanting to create an application using the Microsoft Speech Recognition.
My application's users are expected to often say abbreviated things, such as 'LHC' for 'Large Hadron Collider' or 'CERN'. Given that exact order, my application will return
You said: At age C.
You said: Cern
While it did work for 'CERN', it failed very badly for 'LHC'.
However, if I could make my own custom training files, I could easily place the term 'LHC' somewhere in there. Then, I could make the user access the Speech Control Panel and run my training file.
All the links I have found for this have been frustratingly useless, as they just say things like 'This is ----, you should try going to the ---- forum instead'.
If it does help, here is a list of the links:
http://compgroups.net/comp.speech.users/add-my-own-training/153194
https://groups.google.com/forum/#!topic/microsoft.public.speech.server/v58SH1ov22s
http://social.msdn.microsoft.com/Forums/en/servercorefordevelopers/thread/f7a35f3f-b352-464a-b264-e16eb4afd049
Is my problem even possible? Or are the training files themselves in a special format? If so, can that format be reproduced?
A solution that can also work on Windows XP would be ideal.
Thanks in advance!
P.S. If there are any libraries or modules out there already for this, could anyone point me to some? A Python or C/C++ solution would be splendid. Also, since I'd rather not post another question regarding this, is it possible to utilize the train utilities from command prompt (or without the GUI visible, but still having total command of all controls)?
Okay, pulling this from a thing I wrote three or four years ago now, but I believe you want to do something like this.
The grammar library is a trained system which can recognize words. You can create your own grammar library cued to specific words.
C#, sorry
using System.Speech
using System.Speech.Recognition
using System.Speech.AudioFormat
SpeechRecognitionEngine sre = new SpeechRecognitionEngine();
string[] words = {"L H C", "CERN"};
Choices choices = new Choices(words);
GrammarBuilder gb = new GrammarBuilder(choices);
Grammar grammar = new Grammar(gb);
sre.LoadGrammar(grammar);
That is as far as I can get you. From docs it looks like you can define the pronunciations somehow. So perhaps that way you could have LHC map directly to a single word. Here are the docs on the grammar class - http://msdn.microsoft.com/en-us/library/system.speech.recognition.grammar.aspx
Small update - see example in their docs here http://msdn.microsoft.com/en-us/library/ms554228.aspx
I'm looking for a way to automatically determine the natural language used by a website page, given its URL.
In Python, a function like:
def LanguageUsed (url):
#stuff
Which returns a language specifier (e.g. 'en' for English, 'jp' for Japanese, etc...)
Summary of Results:
I have a reasonable solution working in Python using code from the PyPi for oice.langdet.
It does a decent job in discriminating English vs. Non-English, which is all I require at the moment. Note that you have to fetch the html using Python urllib. Also, oice.langdet is GPL license.
For a more general solution using Trigrams in Python as others have suggested, see this Python Cookbook Recipe from ActiveState.
The Google Natural Language Detection API works very well (if not the best I've seen). However, it is Javascript and their TOS forbids automating its use.
This is usually accomplished by using character n-gram models. You can find here a state of the art language identifier for Java. If you need some help converting it to Python, just ask. Hope it helps.
Your best bet really is to use Google's natural language detection api. It returns an iso code for the page language, with a probability index.
See http://code.google.com/apis/ajaxlanguage/documentation/
There is nothing about the URL itself that will indicate language.
One option would be to use a natural language toolkit to try to identify the language based on the content, but even if you can get the NLP part of it working, it'll be pretty slow. Also, it may not be reliable. Remember, most user agents pass something like
Accept-Language: en-US
with each request, and many large websites will serve different content based on that header. Smaller sites will be more reliable because they won't pay attention to the language headers.
You could also use server location (i.e. which country the server is in) as a proxy for language using GeoIP. It's obviously not perfect, but it is much better than using the TLD.
You might want to try ngram based detection.
TextCat DEMO (LGPL) seems to work pretty well (recognizes almost 70 languages). There is a python port provided by Thomas Mangin here using the same corpus.
Edit: TextCat competitors page provides some interesting links too.
Edit2: I wonder if making a python wrapper for http://www.mnogosearch.org/guesser/ would be difficult...
nltk might help (if you have to get down to dealing with the page's text, i.e. if the headers and the url itself don't determine the language sufficiently well for your purposes); I don't think NLTK directly offers a "tell me which language this text is in" function (though NLTK is large and continuously growing, so it might in fact have it), but you can try parsing the given text according to various possible natural languages and checking which ones give the most sensible parse, wordset, &c, according to the rules for each language.
There's no general method that will work solely on URLs. You can check the top-level domain to get some idea, and look for portions of the URL that might be indicative of a language (like "en" or "es" between two slashes), and assume anything unknown is in English, but it isn't a perfect solution.
So far as I know, the only general way to determine the natural language used by a page is to grab the page's text and check for certain common words in each language. For example, if "a", "an", and "the" appear several times in the page, it's likely that it includes English text; "el" and "la" might suggest Spanish; and so on.
In Python, the langdetect package (found here) can do this.
It is based on Googles automatic language detection and supports by default 55 languages.
It is installed by using
pip install langdetect
And then for example running
from langdetect import detect
detect("War doesn't show who's right, just who's left.")
detect("Ein, zwei, drei, vier")
Will return 'en' and 'de' respectively.