I am new to programming, and I am trying to understand transliteration - like the Google Input Tools that will allow the user to type from one language to another language.
How does transliteration work? Specifically, if I am translating from English to Hindi or English to Russsian, do I need to incorporate a dictionary of words for English, Hindi and Russian languages?
Does any one know of any tutorials showing how to write the code for transliteration? I have tried searching, but no luck.
Also, does the code have to be in JavaScript/JQuery (client side code)? My project is Python/django. Can I write the transliteration code in python/dgango?
Thanks.
Direct dictionary-to-dictionary automatic translation produces poor results due to differences in grammar and the presence of idiomatic sentences. The starting point in python, in my experience, should be NLTK (Natural Language ToolKit) libraries and tutorials.
Then, trying to provide you a working example you may start from here:
Machine Translation using babelize_shell() in NLTK
Translating human languages in Python
Google is your friend
Bing is your friend
The use of javascript/jquery depends on the UI you are planning, maybe you want to trigger an automatic translation after a few key pressed, or onblur or onchange in a input tag but is not relevant for the translation itself.
The process of translating is also really resource consuming, so I discourage you to do it inside a django view. My suggestion is to not reinvent the wheel, and use some already existing API like google or bing ones.
I found that the better search term is Input Method Editor not transliteration.
There is a project on github here: https://github.com/wikimedia/jquery.ime that deals with IME's and transliteration here.
I hope that this helps some one.
The typical way of implementing transliteration is to use a mapping dictionary. An example of this can be seen in the mapping.py file for the CyrTranslit Python package.
Word translation usages a database to convert English word into Hindi Word.
Some apps are based on this concept like:
English to Hindi Dictionary
Related
I know about the Google Translate API in python. However, in my data frame, there are only a few entries that have 'Hindi' language. How do I recognize the language of these records and then translate them to English.
Basically, I want to do the following.
if !hindi, continue else translate from hindi to english.
I am using this - https://pypi.org/project/googletrans/
So, I realized that default is English. The translate command automatically converts other languages to English and if it is English, it stays as it is.
I am working on a project based something on natural language understanding.
So, what I am currently doing is to try and reference the pronouns to their respective antecedents, for which I am trying to build a model. I have worked out the basic part of it, but to complete the task, I need to understand the narrative of the sentence. So what I want is to check whether the noun and object are associated with each other by the verb using an API in python.
Example:
method(laptop, have, operating-system) = yes
method(program, have, operating-system) = No
method("he"/"proper_noun", play, football) = yes
method("he"/"proper_noun", play, college) = No
I've heard about nltk's wordnet API, but I am not sure whether I can use it to perform the same. Can it be used?
Also, I am kind of on a clock.
Any suggestions are welcome and appreciated.
Notes: I am using parsey-mcparseface to break the sentence. I could do the same with nltk but P-MPF is more accurate.
** Why isn't there an NLU tag available? **
Edit 1:
Thanks to alexis, The thing I am trying to do is called "Anaphora Resolution".
The name for what you want is "anaphora resolution", or "coreference resolution". It's a hard problem (probably harder than you realize-- nlp tasks are like that), so unless your purpose is just to learn, I recommend you try some existing solutions. I don't know of an anaphora resolution module in the nltk itself, but you can find it as part of the Stanford CoreNLP suite.
See this question about how to interface to it from the nltk. (I haven't tried it myself).
I have heard about a programming language called AIML which can be used for programming Intelligent Robots.
I am a web developer and have a web crawler build using Python 2.7 and have indexed Wikipedia ...
So I wanted to build a answering engine using python which would use a string variable
(It is a HUGE variable containing the whole of Wikipedia) as a source of information and use AI to answer...
Finally, I wanted to put this up on my school website...
So can I do that in AIML?
Later on I also want to modify it so as to give my live scores answers to questions like:
"What is the age of ~someperson~?" etc.
For that I'll send my web crawler to index some score pages etc..
Can I program this sort of answering agent in AIML?
If yes please provide links to tutorials which tell me how to do that? (using string variables as a source of information to parse queries and answer like a human)
moreover, AIML uses syntax like:
<category>
<pattern>WHAT ARE YOU</pattern>
<template>
<think><set name="topic">Me</set></think>
I am the latest result in artificial intelligence,
which can reproduce the capabilities of the human brain
with greater speed and accuracy.
</template>
</category>
Where pattern is the query and template is answer, so does that mean I have to sit and write these tags for all possible queries?
Or can I make it use its brains to figure out what the person wants and give them answers
using the string variable as its source of information.
Thank you.
AIML
It looks like AIML is a form of pattern matching. Moreover, it looks like this is mainly meant for dialog based agents. Therefore, to use AIML, you would likely need to manually generate every question and the correct response (answer).
Question answering
What it seems like you are really after is what we call a question answering system. Very briefly, a QA system generally has these components:
Question analysis.
Extract keywords.
(Sometimes) determine expected answer type (location, person, color, number, etc.).
Candidate document selection---doing a search on your knowledge base using an information retrieval system.
Candidate document analysis.
Answer extraction---select some part of the document (sentence(s), paragraph(s)).
Response generation.
Scores and ranks each answer.
Displays the most confident answer(s).
Research
If you're really want to dig deeply into this area, I'd suggest using Google Scholar and search for some of the terms I've mentioned, which will give you some research papers that go into detail about many of these topics. Some papers to get you started:
Natural language question answering: The view from here
Answering complex, list and context questions with LCC's Question-Answering Server
The structure and performance of an open-domain question answering system
Learning surface text patterns for a question answering system
Learning question classifiers
What is not in the Bag of Words for Why-QA?
Shameless plug
I've recently taken a course on natural language processing, and developed a rudimentary QA system that uses Wikipedia as a knowledge base. (Actually, I used the Simple English Wikipedia because it was much easier to work with; though the system does work with the full version just much more slowly.)
If you are interested in looking at some Python code as a reference, you may do so on the project's GitHub page: bwbaugh/causeofwhy. In addition, there is some more detailed documentation on what goes on in each step of the system components.
There is also a very basic working demo of the QA system in action that is (currently) available, however bear in mind the system is a proof-of-concept and can take upwards of 30 seconds to respond to a question (depending on the question).
I have been wanting to create an application using the Microsoft Speech Recognition.
My application's users are expected to often say abbreviated things, such as 'LHC' for 'Large Hadron Collider' or 'CERN'. Given that exact order, my application will return
You said: At age C.
You said: Cern
While it did work for 'CERN', it failed very badly for 'LHC'.
However, if I could make my own custom training files, I could easily place the term 'LHC' somewhere in there. Then, I could make the user access the Speech Control Panel and run my training file.
All the links I have found for this have been frustratingly useless, as they just say things like 'This is ----, you should try going to the ---- forum instead'.
If it does help, here is a list of the links:
http://compgroups.net/comp.speech.users/add-my-own-training/153194
https://groups.google.com/forum/#!topic/microsoft.public.speech.server/v58SH1ov22s
http://social.msdn.microsoft.com/Forums/en/servercorefordevelopers/thread/f7a35f3f-b352-464a-b264-e16eb4afd049
Is my problem even possible? Or are the training files themselves in a special format? If so, can that format be reproduced?
A solution that can also work on Windows XP would be ideal.
Thanks in advance!
P.S. If there are any libraries or modules out there already for this, could anyone point me to some? A Python or C/C++ solution would be splendid. Also, since I'd rather not post another question regarding this, is it possible to utilize the train utilities from command prompt (or without the GUI visible, but still having total command of all controls)?
Okay, pulling this from a thing I wrote three or four years ago now, but I believe you want to do something like this.
The grammar library is a trained system which can recognize words. You can create your own grammar library cued to specific words.
C#, sorry
using System.Speech
using System.Speech.Recognition
using System.Speech.AudioFormat
SpeechRecognitionEngine sre = new SpeechRecognitionEngine();
string[] words = {"L H C", "CERN"};
Choices choices = new Choices(words);
GrammarBuilder gb = new GrammarBuilder(choices);
Grammar grammar = new Grammar(gb);
sre.LoadGrammar(grammar);
That is as far as I can get you. From docs it looks like you can define the pronunciations somehow. So perhaps that way you could have LHC map directly to a single word. Here are the docs on the grammar class - http://msdn.microsoft.com/en-us/library/system.speech.recognition.grammar.aspx
Small update - see example in their docs here http://msdn.microsoft.com/en-us/library/ms554228.aspx
I'm looking for a way to automatically determine the natural language used by a website page, given its URL.
In Python, a function like:
def LanguageUsed (url):
#stuff
Which returns a language specifier (e.g. 'en' for English, 'jp' for Japanese, etc...)
Summary of Results:
I have a reasonable solution working in Python using code from the PyPi for oice.langdet.
It does a decent job in discriminating English vs. Non-English, which is all I require at the moment. Note that you have to fetch the html using Python urllib. Also, oice.langdet is GPL license.
For a more general solution using Trigrams in Python as others have suggested, see this Python Cookbook Recipe from ActiveState.
The Google Natural Language Detection API works very well (if not the best I've seen). However, it is Javascript and their TOS forbids automating its use.
This is usually accomplished by using character n-gram models. You can find here a state of the art language identifier for Java. If you need some help converting it to Python, just ask. Hope it helps.
Your best bet really is to use Google's natural language detection api. It returns an iso code for the page language, with a probability index.
See http://code.google.com/apis/ajaxlanguage/documentation/
There is nothing about the URL itself that will indicate language.
One option would be to use a natural language toolkit to try to identify the language based on the content, but even if you can get the NLP part of it working, it'll be pretty slow. Also, it may not be reliable. Remember, most user agents pass something like
Accept-Language: en-US
with each request, and many large websites will serve different content based on that header. Smaller sites will be more reliable because they won't pay attention to the language headers.
You could also use server location (i.e. which country the server is in) as a proxy for language using GeoIP. It's obviously not perfect, but it is much better than using the TLD.
You might want to try ngram based detection.
TextCat DEMO (LGPL) seems to work pretty well (recognizes almost 70 languages). There is a python port provided by Thomas Mangin here using the same corpus.
Edit: TextCat competitors page provides some interesting links too.
Edit2: I wonder if making a python wrapper for http://www.mnogosearch.org/guesser/ would be difficult...
nltk might help (if you have to get down to dealing with the page's text, i.e. if the headers and the url itself don't determine the language sufficiently well for your purposes); I don't think NLTK directly offers a "tell me which language this text is in" function (though NLTK is large and continuously growing, so it might in fact have it), but you can try parsing the given text according to various possible natural languages and checking which ones give the most sensible parse, wordset, &c, according to the rules for each language.
There's no general method that will work solely on URLs. You can check the top-level domain to get some idea, and look for portions of the URL that might be indicative of a language (like "en" or "es" between two slashes), and assume anything unknown is in English, but it isn't a perfect solution.
So far as I know, the only general way to determine the natural language used by a page is to grab the page's text and check for certain common words in each language. For example, if "a", "an", and "the" appear several times in the page, it's likely that it includes English text; "el" and "la" might suggest Spanish; and so on.
In Python, the langdetect package (found here) can do this.
It is based on Googles automatic language detection and supports by default 55 languages.
It is installed by using
pip install langdetect
And then for example running
from langdetect import detect
detect("War doesn't show who's right, just who's left.")
detect("Ein, zwei, drei, vier")
Will return 'en' and 'de' respectively.