Automatically determine the natural language of a website page given its URL

Automatically determine the natural language of a website page given its URL - python

I'm looking for a way to automatically determine the natural language used by a website page, given its URL.
In Python, a function like:
def LanguageUsed (url):
#stuff
Which returns a language specifier (e.g. 'en' for English, 'jp' for Japanese, etc...)
Summary of Results:
I have a reasonable solution working in Python using code from the PyPi for oice.langdet.
It does a decent job in discriminating English vs. Non-English, which is all I require at the moment. Note that you have to fetch the html using Python urllib. Also, oice.langdet is GPL license.
For a more general solution using Trigrams in Python as others have suggested, see this Python Cookbook Recipe from ActiveState.
The Google Natural Language Detection API works very well (if not the best I've seen). However, it is Javascript and their TOS forbids automating its use.

This is usually accomplished by using character n-gram models. You can find here a state of the art language identifier for Java. If you need some help converting it to Python, just ask. Hope it helps.

Your best bet really is to use Google's natural language detection api. It returns an iso code for the page language, with a probability index.
See http://code.google.com/apis/ajaxlanguage/documentation/

There is nothing about the URL itself that will indicate language.
One option would be to use a natural language toolkit to try to identify the language based on the content, but even if you can get the NLP part of it working, it'll be pretty slow. Also, it may not be reliable. Remember, most user agents pass something like
Accept-Language: en-US
with each request, and many large websites will serve different content based on that header. Smaller sites will be more reliable because they won't pay attention to the language headers.
You could also use server location (i.e. which country the server is in) as a proxy for language using GeoIP. It's obviously not perfect, but it is much better than using the TLD.

You might want to try ngram based detection.
TextCat DEMO (LGPL) seems to work pretty well (recognizes almost 70 languages). There is a python port provided by Thomas Mangin here using the same corpus.
Edit: TextCat competitors page provides some interesting links too.
Edit2: I wonder if making a python wrapper for http://www.mnogosearch.org/guesser/ would be difficult...

nltk might help (if you have to get down to dealing with the page's text, i.e. if the headers and the url itself don't determine the language sufficiently well for your purposes); I don't think NLTK directly offers a "tell me which language this text is in" function (though NLTK is large and continuously growing, so it might in fact have it), but you can try parsing the given text according to various possible natural languages and checking which ones give the most sensible parse, wordset, &c, according to the rules for each language.

There's no general method that will work solely on URLs. You can check the top-level domain to get some idea, and look for portions of the URL that might be indicative of a language (like "en" or "es" between two slashes), and assume anything unknown is in English, but it isn't a perfect solution.
So far as I know, the only general way to determine the natural language used by a page is to grab the page's text and check for certain common words in each language. For example, if "a", "an", and "the" appear several times in the page, it's likely that it includes English text; "el" and "la" might suggest Spanish; and so on.

In Python, the langdetect package (found here) can do this.
It is based on Googles automatic language detection and supports by default 55 languages.
It is installed by using
pip install langdetect
And then for example running
from langdetect import detect
detect("War doesn't show who's right, just who's left.")
detect("Ein, zwei, drei, vier")
Will return 'en' and 'de' respectively.

Related

AIML for Intelligent Answering Engine

I have heard about a programming language called AIML which can be used for programming Intelligent Robots.
I am a web developer and have a web crawler build using Python 2.7 and have indexed Wikipedia ...
So I wanted to build a answering engine using python which would use a string variable
(It is a HUGE variable containing the whole of Wikipedia) as a source of information and use AI to answer...
Finally, I wanted to put this up on my school website...
So can I do that in AIML?
Later on I also want to modify it so as to give my live scores answers to questions like:
"What is the age of ~someperson~?" etc.
For that I'll send my web crawler to index some score pages etc..
Can I program this sort of answering agent in AIML?
If yes please provide links to tutorials which tell me how to do that? (using string variables as a source of information to parse queries and answer like a human)
moreover, AIML uses syntax like:
<category>
<pattern>WHAT ARE YOU</pattern>
<template>
<think><set name="topic">Me</set></think>
I am the latest result in artificial intelligence,
which can reproduce the capabilities of the human brain
with greater speed and accuracy.
</template>
</category>
Where pattern is the query and template is answer, so does that mean I have to sit and write these tags for all possible queries?
Or can I make it use its brains to figure out what the person wants and give them answers
using the string variable as its source of information.
Thank you.

AIML
It looks like AIML is a form of pattern matching. Moreover, it looks like this is mainly meant for dialog based agents. Therefore, to use AIML, you would likely need to manually generate every question and the correct response (answer).
Question answering
What it seems like you are really after is what we call a question answering system. Very briefly, a QA system generally has these components:
Question analysis.
Extract keywords.
(Sometimes) determine expected answer type (location, person, color, number, etc.).
Candidate document selection---doing a search on your knowledge base using an information retrieval system.
Candidate document analysis.
Answer extraction---select some part of the document (sentence(s), paragraph(s)).
Response generation.
Scores and ranks each answer.
Displays the most confident answer(s).
Research
If you're really want to dig deeply into this area, I'd suggest using Google Scholar and search for some of the terms I've mentioned, which will give you some research papers that go into detail about many of these topics. Some papers to get you started:
Natural language question answering: The view from here
Answering complex, list and context questions with LCC's Question-Answering Server
The structure and performance of an open-domain question answering system
Learning surface text patterns for a question answering system
Learning question classifiers
What is not in the Bag of Words for Why-QA?
Shameless plug
I've recently taken a course on natural language processing, and developed a rudimentary QA system that uses Wikipedia as a knowledge base. (Actually, I used the Simple English Wikipedia because it was much easier to work with; though the system does work with the full version just much more slowly.)
If you are interested in looking at some Python code as a reference, you may do so on the project's GitHub page: bwbaugh/causeofwhy. In addition, there is some more detailed documentation on what goes on in each step of the system components.
There is also a very basic working demo of the QA system in action that is (currently) available, however bear in mind the system is a proof-of-concept and can take upwards of 30 seconds to respond to a question (depending on the question).

Can some one explain how transliteration works?

I am new to programming, and I am trying to understand transliteration - like the Google Input Tools that will allow the user to type from one language to another language.
How does transliteration work? Specifically, if I am translating from English to Hindi or English to Russsian, do I need to incorporate a dictionary of words for English, Hindi and Russian languages?
Does any one know of any tutorials showing how to write the code for transliteration? I have tried searching, but no luck.
Also, does the code have to be in JavaScript/JQuery (client side code)? My project is Python/django. Can I write the transliteration code in python/dgango?
Thanks.

Direct dictionary-to-dictionary automatic translation produces poor results due to differences in grammar and the presence of idiomatic sentences. The starting point in python, in my experience, should be NLTK (Natural Language ToolKit) libraries and tutorials.
Then, trying to provide you a working example you may start from here:
Machine Translation using babelize_shell() in NLTK
Translating human languages in Python
Google is your friend
Bing is your friend
The use of javascript/jquery depends on the UI you are planning, maybe you want to trigger an automatic translation after a few key pressed, or onblur or onchange in a input tag but is not relevant for the translation itself.
The process of translating is also really resource consuming, so I discourage you to do it inside a django view. My suggestion is to not reinvent the wheel, and use some already existing API like google or bing ones.

I found that the better search term is Input Method Editor not transliteration.
There is a project on github here: https://github.com/wikimedia/jquery.ime that deals with IME's and transliteration here.
I hope that this helps some one.

The typical way of implementing transliteration is to use a mapping dictionary. An example of this can be seen in the mapping.py file for the CyrTranslit Python package.

Word translation usages a database to convert English word into Hindi Word.
Some apps are based on this concept like:
English to Hindi Dictionary

Mining Wikipedia for mapping relations for text mining

I am planning to develop a web-based application which could crawl wikipedia for finding relations and store it in a database. By relations, I mean searching for a name say,'Bill Gates' and find his page, download it and pull out the various information from the page and store it in a database. Information may include his date of birth, his company and a few other things. But I need to know if there is any way to find these unique data from the page, so that I could store them in a database. Any specific books or algorithms would be greatly appreciated. Also mentioning of good opensource libraries would be helpful.
Thank You

If you haven't already, you should have a look at DBpedia. Many categories of wiki articles have "Infoboxes" for the kinds of information you describe, and they've made a database out of it:
http://en.wikipedia.org/wiki/DBpedia
You might also leverage some of the information in Metaweb's Freebase (which overlaps and I believe may even integrate the info from DBpedia.) They have an API for querying their graph database, and there's a Python wrapper for it called freebase-python.
UPDATE: Freebase is no more; they were acquired by Google and eventually folded into the Google Knowledge Graph. There is an API but I don't think they have anything like the formal sync'ing Freebase had with public sources like Wikipedia. I'm personally disappointed in how this looks to have turned out. :-/
As for the natural language processing bit, if you do make headway on that problem you might consider these databases as repositories for any information you do mine.

You mention Python and Open Source, so I would investigate the NLTK (Natural Language Toolkit). Text mining and natural language processing is one of those things that you can do a lot with a dumb algorithm (eg. Pattern matching), but if you want to go a step further and do something more sophisticated - ie. Trying to extract information that is stored in a flexible manner or trying to find information that might be interesting but is not known a priori, then natural language processing should be investigated.
NLTK is intended for teaching, so it is a toolkit. This approach suits Python very well. There are a couple of books for it as well. The O'Reilly book is also published online with an open license. See NLTK.org

Jvc, there are existing python modules that can do everything you mentioned above.
For pulling information from webpages, I like to use Selenium, http://seleniumhq.org/projects/ide/. Basically, you can localize and retrieve information on any webpage using a number of identifiers (id, Xpath, etc).
However, like winwaed said, it can be inflexible if you are simply "pattern matching", especially since some websites use dynamic code- meaning the identifiers can change with each subsequent reload of the page. But, this problem can be solved by adding regular expressions, i.e. (.*), to your code. Check out this youtube video, http://www.youtube.com/watch?v=Ap_DlSrT-iE. Even though he is using BeautifulSoup to scrape the website- you can see how he uses regular expressions to pull the information from the page.
Also, I'm not sure what type of database you are working with, but pyodbc, http://code.google.com/p/pyodbc/, can work with SQL types, and also mainstream databases like Microsoft Access.
So, my advice is to look into Selenium for finding the info on the webpage, pyodbc to store and retrieve it, and regular expressions when the identifiers are dynamic.

Design help for static content with fixed keywords search framework

I am trying to work out a solution for detecting traceability between source code and documentation. The most important use case is that the user needs to see the a collection of source code tokens (sorted by relevance to the documentation) that can be traced back to the documentation. She is wont be bothered about the code format, but somehow needs to see an "identifier- documentation" mapping to get the idea of traceability.
I take the tokens from source code files - somehow split the concatenated identifiers (SimpleMAXAnalyzer becomes "simple max analyzer"), which then act as search terms on the documentation. Search frameworks are best for doing this specific task - drilling down documents to locate stuff using powerful information retrieval algorithms. Whoosh looked really great python search... with a number of analyzer and filters.
Though the problem is similar to search - it differs in that the user is not physically doing any search. So am I solving the problem the right way? Given that everything is static and needs to computed only once - am I using a wrong tool(a search framework) for the job?

I'm not sure, if I understand your use case. The user sees the source code and has some ways of jumping from a token to the appropriate part or a listing of the possible parts of the documentation, right?
Then a search tool seems to be the right tool for the job, although you could precompile every possible search (there is only a limited number of identifiers in the source, so you can calculate all possible references to the docs in advance).
Or are there any "canonical" parts of the documentation for every identifier? Then maybe some kind of index would be a better choice.
Maybe you could clarify your use case a bit further.
Edit: Maybe an alphabetical index of the documentation could be a step to the solution. Then you can look up the pages/chapters/sections for every token of the source, where all or most of its components are mentioned.

What web programming languages are capable of making this web app?

I'm exploring many technologies, but I would like your input on which web framework would make this the easiest/ most possible. I'm currently looking to JSP/JSF/Primefaces, but I'm not sure if that is capable of this app.
Here's a basic description of the app:
Users log in with their username and password (maybe I can somehow incorporate OPENID)?
With a really nice UI, they will be presented a large list of questions specific to a certain category, for example, "Cooking". (I will manually compile this list and make it available.)
When they click on any of these questions, a little input box opens up below it to allow the user to put in a link/URL.
If the link they enter has the same question on that webpage the URL points to, they will be awarded one point. This question then disappears and gets added to a different page that has a list of all correctly linked questions.
On the right side of the screen, there will be a leaderboard with the usernames of the people with the top ten points.
The idea is relatively simple - to be able to compile links to external websites for specific questions by allowing many people to contribute.
I know I can build the UI easily with Primefaces. [B]What I'm not sure is if JSP/JSF gives the ability to parse HTML at a certain URL to see if it contains words.[/B] I can do this with python easily by using urllib, but I can't use python for web GUI building (it is very difficult). What is the best approach?
Any help would be appreciated!!! Thanks!

The best approach is whatever is best for you. If Python isn't your strength but Java is, then use Java. If you're a Python expert and know little Java, use Python.
There are so many resources on the Internet supporting so many platforms that the decision really comes down to what works best for you.

For starters, forget about JSP/JSF. This is an old combination that had many problems. Please consider Facelets/JSF. Facelets is the default templating language in the current version of JSF, while JSP is there only for backwards compatibility.
What I'm not sure is if JSP/JSF gives the ability to parse HTML at a certain URL to see if it contains words.
Yes it does, although the actual fetching of data and parsing of its content will be done by plain Java code. This itself has nothing to do with the JSF APIs.
With JSF you create a Facelet containing your UI (input fields, buttons, etc). Then still using JSF you bind this to a so-called backing bean, which is primarily a normal Java class with only one or two JSF specific annotations applied to it (e.g. #ManagedBean).
When the user enters the URL and presses some button, JSF takes care of calling some action method in your Java class (backing bean). In this action method you now have access to the URL the user entered, and from here on plain Java coding starts and JSF specifics end. You can put the code that fetches the URL and does the parsing you require in a separate helper class (separation of concerns), or at your discretion directly in the backing bean. The choice is yours.
Incidentally we had a very junior programmer at our office use JSF for something not unlike what you are requesting here and he succeeded in doing it in a short time. It thus really isn't that hard ;)

No web technology does what you want. Parsing documents found at certain urls is out of the scope of building web interfaces.
However, each of Java's web technologies will give you, without limits, access to a rich and varied (if not too rich and much too varied) set of libraries and frameworks running on JVM. You could safely say that if there is a library for doing something, there will be a Java version available. Downloading and parsing a document will not require more than what is available in the standard library (unless you insist on injecting your dependencies or crosscutting your concerns), so no problems with doing your project with JSP, or JSF/Primefaces, or whatever.
Since you claim to already know Python, and since you will have to add some HTML/CSS anyway, I suggest you try Django. It's dead simple, has a set of OpenID plugins to choose from, will give you admin interface for free (so you can prime the pump with the first set of links).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.