Given a word (English or non-English), how can I construct a list of words (English or non-English) with similar spelling?
For example, given the word 'sira', some similar words are:
sirra
seira
siara
saira
shira
I'd prefer this to be on the verbose side, meaning it should generate as many words as possible.
Preferably in Python, but code in any language is helpful.
The Australian Business Register ABN lookup tool (a tool that finds business registration numbers based on search keywords) does a good job of this.
Thanks
The thing you are looking for is provided by ispell (and family) of dictionaries. There is a relatively easy interface via hunspell library.
The actual data (dictionaries) you can download from here (among other places, like OpenOffice plugin pages).
There is an interface to get a number of similar words based on the edit distance suggested in the comment. Going with the example from GitHub:
>>> import hunspell
>>> hobj = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')
>>> hobj.spell('spookie')
False
>>> hobj.suggest('spookie')
['spookier', 'spookiness', 'spook', 'cookie', 'bookie', 'Spokane', 'spoken']
For searching in databases use "LIKE"
The query you'd want is
SELECT * FROM `testTable` WHERE name LIKE "%s%i%r%a%
Related
I have to identify cities in a document (has only characters), I do not want to maintain an entire vocabulary as it is not a practical solution. I also do not have Azure text analytics api account.
I have already tried using Spacy, I did ner and identified geolocation and that output is passed to spellchecker() to train the model. But the issue with this is that ner requires sentences and my input has words.
I am relatively new to this field.
You can check out the geotext library.
Working example with a sentence:
text = "The capital of Belarus is Minsk. Minsk is not so far away from Kiev or Moscow. Russians and Belarussians are nice people."
from geotext import GeoText
places = GeoText(text)
print(places.cities)
Output:
['Minsk', 'Minsk', 'Kiev', 'Moscow']
Working example with list of words:
wordList = ['London', 'cricket', 'biryani', 'Vilnius', 'Delhi']
for i in range(len(wordList)):
places = GeoText(wordList[i])
if places.cities:
print(places.cities)
Output:
['London']
['Vilnius']
['Delhi']
geograpy is another alternative. However, I find geotext light due to lesser number of external dependencies.
there is a list of libraries that may help you,
but from my experience, there is not a perfect library for this. If you know all the cities that may appear in the text, then vocabulary is the best thing
I'm looking for a way to make a function in python where you pass in a string and it returns whether it's spelled correctly. I don't want to check against a dictionary. Instead, I want it to check Google's spelling suggestions. That way, celebrity names and other various proper nouns will count as being spelled correctly.
Here's where I'm at so far. It works most of the time, but it messes up with some celebrity names. For example, things like "cee lo green" or "posner" get marked as incorrect.
import httplib
import xml.dom.minidom
data = """
<spellrequest textalreadyclipped="0" ignoredups="0" ignoredigits="1" ignoreallcaps="1">
<text> %s </text>
</spellrequest>
"""
def spellCheck(word_to_spell):
con = httplib.HTTPSConnection("www.google.com")
con.request("POST", "/tbproxy/spell?lang=en", data % word_to_spell)
response = con.getresponse()
dom = xml.dom.minidom.parseString(response.read())
dom_data = dom.getElementsByTagName('spellresult')[0]
if dom_data.childNodes:
for child_node in dom_data.childNodes:
result = child_node.firstChild.data.split()
for word in result:
if word_to_spell.upper() == word.upper():
return True;
return False;
else:
return True;
Peter Norvig tells you how implement spell checker in Python.
Rather than sticking to Mr. Google, try out other big fellows.
If you really want to stick with search engines which count page requests, Yahoo and Bing are providing some excellent features. Yahoo is directly providing spell checking services using YQL tables (Free: 5000 request/day and non-commercial).
You have good number of Python API's which are capable to do a lot similar magic including on nouns that you mentioned (sometimes may turn around - after all its somewhere based upon probability)
So, in the second case, you got a good list (totally free)
GNU - Aspell (Even got python bindings)
PyEnchant
Whoosh (It does a lot more than spell checking but I think it has some edge on it.)
I hope they should give you a clear idea of how things work.
Actually spell checking involves very complex mechanisms in the areas of Machine learning, AI, NLP.. etc a lot more. So, companies like Google/ Yahoo don't really offer their API entirely free.
Is there any python library to get a list of countries for a specific language code where it is an official or commonly used language?
For example, language code of "fr" is associated with 29 countries where French is an official language plus 8 countries where it's commonly used.
Despite the accepted answer, as far as I can tell none of the xml files underlying pycountry contains a way to map languages to countries. It contains lists of languages and their iso codes, and lists of countries and their iso codes, plus other useful stuff, but not that.
Similarly, the Babel package is great but after digging around for a while I couldn't find any way to list all languages for a particular country. The best you can do is the 'most likely' language: https://stackoverflow.com/a/22199367/202168
So I had to get it myself...
import lxml.etree
import urllib.request
def get_territory_languages():
url = "https://raw.githubusercontent.com/unicode-org/cldr/master/common/supplemental/supplementalData.xml"
langxml = urllib.request.urlopen(url)
langtree = lxml.etree.XML(langxml.read())
territory_languages = {}
for t in langtree.find('territoryInfo').findall('territory'):
langs = {}
for l in t.findall('languagePopulation'):
langs[l.get('type')] = {
'percent': float(l.get('populationPercent')),
'official': bool(l.get('officialStatus'))
}
territory_languages[t.get('type')] = langs
return territory_languages
You probably want to store the result of this in a file rather than calling across the web every time you need it.
This dataset contains 'unofficial' languages as well, you may not want to include those, here's some more example code:
TERRITORY_LANGUAGES = get_territory_languages()
def get_official_locale_ids(country_code):
country_code = country_code.upper()
langs = TERRITORY_LANGUAGES[country_code].items()
# most widely-spoken first:
langs.sort(key=lambda l: l[1]['percent'], reverse=True)
return [
'{lang}_{terr}'.format(lang=lang, terr=country_code)
for lang, spec in langs if spec['official']
]
get_official_locale_ids('es')
>>> ['es_ES', 'ca_ES', 'gl_ES', 'eu_ES', 'ast_ES']
Look for the Babel package. It has a pickle file for each supported locale. See the list() function in the localedata module for getting a list of ALL locales. Then write some code to split the locales into (language, country) etc etc
As requested by #NoahSantacruz, I add this as a separate answer to make it easier to pick it up. At least since 2017, the easiest method from far is:
babel.languages.get_territory_language_info()
See the docs http://babel.pocoo.org/en/latest/api/languages.html#babel.languages.get_territory_language_info
Check out Ethnologue
Be careful though...
India has a lot of official languages.
pycountry (seriously). You can get it from the Package Index.
Usually when we search, we have a list of stories, we provide a search string, and expect back a list of results where the given search strings matches the story.
What I am looking to do, is the opposite. Give a list of search strings, and one story and find out which search strings match to that story.
Now this could be done with re but the case here is i wanna use complex search queries as supported by solr. Full details of the query syntax here. Note: i wont use boost.
Basically i want to get some pointers for the doesitmatch function in the sample code below.
def doesitmatch(contents, searchstring):
"""
returns result of searching contents for searchstring (True or False)
"""
???????
???????
story = "big chunk of story 200 to 1000 words long"
searchstrings = ['sajal' , 'sajal AND "is a jerk"' , 'sajal kayan' , 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python))' , 'bangkok']
matches = [[searchstr] for searchstr in searchstrings if doesitmatch(story, searchstr) ]
Edit: Additionally would also be interested to know if any module exists to convert lucene query like below into regex:
sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python) OR "is a jerk")
After extensive googling, i realized what i am looking to do is a Boolean search.
Found the code that makes regex boolean aware : http://code.activestate.com/recipes/252526/
Issue looks solved for now.
Probably slow, but easy solution:
Make a query on the story plus each string to the search engine. If it returns anything, then it matches.
Otherwise you need to implement the search syntax yourself. If that includes things like "title:" and stuff this can be rather complex. If it's only the AND and OR from your example, then it's a recursive function that isn't too hairy.
Some time ago I looked for a python implementaion of lucene and I came accross of Woosh which is a pure python text-based research engine. Maybe it will statisfy your needs.
You can also try pyLucene, but i did'nt investigate this one.
Here's a suggestion in pseudocode. I'm assuming you store a story identifier with the search terms in the index, so that you can retrieve it with the search results.
def search_strings_matching(story_id_to_match, search_strings):
result = set()
for s in search_strings:
result_story_ids = query_index(s) # query_index returns an id iterable
if story_id_to_match in result_story_ids:
result.add(s)
return result
This is probably less interesting to you now, since you've already solved your problem, but what you're describing sounds like Prospective Search, which is what you call it when you have the query first and you want to match it against documents as they come along.
Lucene's MemoryIndex is a class that was designed specifically for something like this, and in your case it might be efficient enough to run many queries against a single document.
This has nothing to do with Python, though. You'd probably be better off writing something like this in java.
If you are writing Python on AppEngine, you can use the AppEngine Prospective Search Service to achieve exactly what you are trying to do here. See: http://code.google.com/appengine/docs/python/prospectivesearch/overview.html
I am writing a fairly simple Django application where users can enter string queries. The application will the search through the database for this string.
Entry.objects.filter(headline__contains=query)
This query is pretty strait forward but not really helpful to someone who isn't 100% sure what they are looking for. So I expanded the search.
from django.utils import stopwords
results = Entry.objects.filter(headline__contains=query)
if(!results):
query = strip_stopwords(query)
for(q in query.split(' ')):
results += Entry.objects.filter(headline__contains=q)
I would like to add some additional functionality to this. Searching for miss spelled words, plurals, common homophones (sound the same spelled differently), ect. I was just wondering if any of these things were built into Djangos query language. It isn't important enough for me to write a huge algorithm for I am really just looking for something built in.
Thanks in advance for all the answers.
You could try using python's difflib module.
>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']
Problem is that to use difflib one must build a list of words from the database. That can be expensive. Maybe if you cache the list of words and only rebuild it once in a while.
Some database systems support a search method to do what you want, like PostgreSQL's fuzzystrmatch module. If that is your case you could try calling it.
edit:
For your new "requirement", well, you are out of luck. No, there is nothing built in inside django's query language.
djangos orm doesn't have this behavior out-of-box, but there are several projects that integrate django w/ search services like:
sphinx (django-sphinx)
solr, a lightweight version of lucene (djangosearch)
lucene (django-search-lucene)
I cant speak to how well options #2 and #3 work, but I've used django-sphinx quite a lot, and am very happy with the results.
cal_name = request.data['column']['name']
words = []
for col in Column.objects.all():
if cal_name != col.name:
words.append(col.name)
words = difflib.get_close_matches(cal_name, words)
if len(words) > 0 and is_sure != "true":
return Response({
'potential typo': 'Did you mean ' + str(words) + '?',
"note": "If you think you do not have a typo send {'sure' : 'true'} with the data."})