I'm looking for a way to make a function in python where you pass in a string and it returns whether it's spelled correctly. I don't want to check against a dictionary. Instead, I want it to check Google's spelling suggestions. That way, celebrity names and other various proper nouns will count as being spelled correctly.
Here's where I'm at so far. It works most of the time, but it messes up with some celebrity names. For example, things like "cee lo green" or "posner" get marked as incorrect.
import httplib
import xml.dom.minidom
data = """
<spellrequest textalreadyclipped="0" ignoredups="0" ignoredigits="1" ignoreallcaps="1">
<text> %s </text>
</spellrequest>
"""
def spellCheck(word_to_spell):
con = httplib.HTTPSConnection("www.google.com")
con.request("POST", "/tbproxy/spell?lang=en", data % word_to_spell)
response = con.getresponse()
dom = xml.dom.minidom.parseString(response.read())
dom_data = dom.getElementsByTagName('spellresult')[0]
if dom_data.childNodes:
for child_node in dom_data.childNodes:
result = child_node.firstChild.data.split()
for word in result:
if word_to_spell.upper() == word.upper():
return True;
return False;
else:
return True;
Peter Norvig tells you how implement spell checker in Python.
Rather than sticking to Mr. Google, try out other big fellows.
If you really want to stick with search engines which count page requests, Yahoo and Bing are providing some excellent features. Yahoo is directly providing spell checking services using YQL tables (Free: 5000 request/day and non-commercial).
You have good number of Python API's which are capable to do a lot similar magic including on nouns that you mentioned (sometimes may turn around - after all its somewhere based upon probability)
So, in the second case, you got a good list (totally free)
GNU - Aspell (Even got python bindings)
PyEnchant
Whoosh (It does a lot more than spell checking but I think it has some edge on it.)
I hope they should give you a clear idea of how things work.
Actually spell checking involves very complex mechanisms in the areas of Machine learning, AI, NLP.. etc a lot more. So, companies like Google/ Yahoo don't really offer their API entirely free.
Related
Given a DBpedia resource, I want to find the entire taxonomy till root.
For example, if I were to say in plain English, for Barack Obama I want to know the entire taxonomy which goes as Barack Obama → Politician → Person → Being.
I have written the following recursive function for the same:
import requests
import json
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
def get_taxonomy(results,entity,hypernym_list):
'''This recursive function keeps on fetching the hypernyms of the
DBpedia resource recursively till the highest concept or root is reached'''
if entity == 'null':
return hypernym_list
else :
query = ''' SELECT ?hypernyms WHERE {<'''+entity+'''> <http://purl.org/linguistics/gold/hypernym> ?hypernyms .}
'''
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
hypernym_list.append(result['hypernyms']['value'])
if len(results["results"]["bindings"]) == 0:
return get_taxonomy(results,'null',hypernym_list)
return get_taxonomy(results,results["results"]["bindings"][0]['hypernyms']['value'],hypernym_list)
def get_taxonomy_of_resource(dbpedia_resource):
list_for_hypernyms=[]
results = {}
results["results"]={}
results["results"]["bindings"]=[1,2,3]
taxonomy_list = get_taxonomy(results,dbpedia_resource,list_for_hypernyms)
return taxonomy_list
The code works for the following input:
get_taxonomy_of_resource('http://dbpedia.org/resource/Barack_Obama')
Output:
['http://dbpedia.org/resource/Politician',
'http://dbpedia.org/resource/Person', 'http://dbpedia.org/resource/Being']
Problem :
But for following output it only gives hypernym till one level above and stops:
get_taxonomy_of_resource('http://dbpedia.org/resource/Steve_Jobs')
Output:
['http://dbpedia.org/resource/Entrepreneur']
Research:
On doing some research on their site dbpedia.org/page/<term> I realized that the reason it stopped at Entrepreneur is that when I click on this resource on their site, it takes me to resource 'Entrepreneurship' and state its hypernym as 'Process'. So now my problem has been directed to the question:
How do I know that Entrepreneur is directing to Entrepreneurship even though both are valid DBpedia entities? My recursive function fails due to this as in next iteration it attempts to find hypernym for Entrepreneur rather than Entrepreneurship.
Any help is duly appreciated
I have faced this same problem before while writing a program to generate taxonomies and my solution was to use in addition wiktionary when my main ressource failed to provide a hypernym.
The wiktionary dump can be downloaded and parsed into a python dictionary.
For example, the wiktionary entry for 'entrepreneur' contains the following:
Noun
entrepreneur (plural entrepreneurs)
A person who organizes and operates a business venture and assumes much of the associated risk.
From this definition, the hypernym ('person') can be extracted.
Naturally, this approach entails writing code to extract the hypernym from a definition (a task which is at times easy and at times hard depending on the wording of the definition).
This approach provides a fallback routine in cases when the main ressource (DBpedia in your case) fails to provide a hypernym.
Finally, as stated by AKSW, it is good to have a method to capture incorrect hypernym relations (e.g. Entrepreneur - Process). There is the area of textual entailment in natural language processing, which studies methods for determining if a statement contradicts (or implies or .. ) another statement.
I'd like to write a script (preferably in python, but other languages is not a problem), that can parse what you type into a google search. Suppose I search 'cats', then I'd like to be able to parse the string cats and, for example, append it to a .txt file on my computer.
So if my searches were 'cats', 'dogs', 'cows' then I could have a .txt file like so,
cats
dogs
cows
Anyone know any APIs that can parse the search bar and return the string inputted? Or some object that I can cast into a string?
EDIT: I don't want to make a chrome extension or anything, but preferably a python (or bash or ruby) script I can run in terminal that can do this.
Thanks
If you have access to the URL, you can look for "&q=" to find the search term. (http://google.com/...&q=cats..., for example).
I can offer 2 popular solution
1) Google have a search-engine API https://developers.google.com/products/#google-search
(It have restriction on 100 requests per day)
cutted code:
def gapi_parser(args):
query = args.text; count = args.max_sites
import config
api_key = config.api_key
cx = config.cx
#Note: This API returns up to the first 100 results only.
#https://developers.google.com/custom-search/v1/using_rest?hl=ru-RU#WorkingResults
results = []; domains = set(); errors = []; start = 1
while True:
req = 'https://www.googleapis.com/customsearch/v1?key={key}&cx={cx}&q={q}&alt=json&start={start}'.format(key=api_key, cx=cx, q=query, start=start)
if start>=100: #google API does not can do more
break
con = urllib2.urlopen(req)
if con.getcode()==200:
data = con.read()
j = json.loads(data)
start = int(j['queries']['nextPage'][0]['startIndex'])
for item in j['items']:
match = re.search('^(https?://)?\w(\w|\.|-)+', item['link'])
if match:
domain = match.group(0)
if domain not in results:
results.append(domain)
domains.update([domain])
else:
errors.append('Can`t recognize domain: %s' % item['link'])
if len(domains) >= args.max_sites:
break
print
for error in errors:
print error
return (results, domains)
2) I wrote a selenuim based script what parse a page in real browser instance, but this solution have a some restrictions, for example captcha if you run searches like a robots.
A few options you might consider, with their advantages and disadvantages:
URL:
advantage: as Chris mentioned, accessing the URL and manually changing it is an option. It should be easy to write a script for this, and I can send you my perl script if you want
disadvantage: I am not sure if you can do it. I made a perl script for that before, but it didn't work because Google states that you can't use its services outside the Google interface. You might face the same problem
Google's search API:
advantage: popular choice. Good documentation. It should be a safe choice
disadvantage: Google's restrictions.
Research other search engines:
advantage: they might not have the same restrictions as Google. You might find some search engines that let you play around more and have more freedom in general.
disadvantage: you're not going to get results that are as good as Google's
I have data coming in to a python server via a socket. Within this data is the string '<port>80</port>' or which ever port is being used.
I wish to extract the port number into a variable. The data coming in is not XML, I just used the tag approach to identifying data for future XML use if needed. I do not wish to use an XML python library, but simply use something like regexp and strings.
What would you recommend is the best way to match and strip this data?
I am currently using this code with no luck:
p = re.compile('<port>\w</port>')
m = p.search(data)
print m
Thank you :)
Regex can't parse XML and shouldn't be used to parse fake XML. You should do one of
Use a serialization method that is nicer to work with to start with, such as JSON or an ini file with the ConfigParser module.
Really use XML and not something that just sort of looks like XML and really parse it with something like lxml.etree.
Just store the number in a file if this is the entirety of your configuration. This solution isn't really easier than just using JSON or something, but it's better than the current one.
Implementing a bad solution now for future needs that you have no way of defining or accurately predicting is always a bad approach. You will be kept busy enough trying to write and maintain software now that there is no good reason to try to satisfy unknown future needs. I have never seen a case where "I'll put this in for later" has led to less headache later on, especially when I put it in by doing something completely wrong. YAGNI!
As to what's wrong with your snippet other than using an entirely wrong approach, angled brackets have a meaning in regex.
Though Mike Graham is correct, using regex for xml is not 'recommended', the following will work:
(I have defined searchType as 'd' for numerals)
searchStr = 'port'
if searchType == 'd':
retPattern = '(<%s>)(\d+)(</%s>)'
else:
retPattern = '(<%s>)(.+?)(</%s>)'
searchPattern = re.compile(retPattern % (searchStr, searchStr))
found = searchPattern.search(searchStr)
retVal = found.group(2)
(note the complete lack of error checking, that is left as an exercise for the user)
Usually when we search, we have a list of stories, we provide a search string, and expect back a list of results where the given search strings matches the story.
What I am looking to do, is the opposite. Give a list of search strings, and one story and find out which search strings match to that story.
Now this could be done with re but the case here is i wanna use complex search queries as supported by solr. Full details of the query syntax here. Note: i wont use boost.
Basically i want to get some pointers for the doesitmatch function in the sample code below.
def doesitmatch(contents, searchstring):
"""
returns result of searching contents for searchstring (True or False)
"""
???????
???????
story = "big chunk of story 200 to 1000 words long"
searchstrings = ['sajal' , 'sajal AND "is a jerk"' , 'sajal kayan' , 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python))' , 'bangkok']
matches = [[searchstr] for searchstr in searchstrings if doesitmatch(story, searchstr) ]
Edit: Additionally would also be interested to know if any module exists to convert lucene query like below into regex:
sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python) OR "is a jerk")
After extensive googling, i realized what i am looking to do is a Boolean search.
Found the code that makes regex boolean aware : http://code.activestate.com/recipes/252526/
Issue looks solved for now.
Probably slow, but easy solution:
Make a query on the story plus each string to the search engine. If it returns anything, then it matches.
Otherwise you need to implement the search syntax yourself. If that includes things like "title:" and stuff this can be rather complex. If it's only the AND and OR from your example, then it's a recursive function that isn't too hairy.
Some time ago I looked for a python implementaion of lucene and I came accross of Woosh which is a pure python text-based research engine. Maybe it will statisfy your needs.
You can also try pyLucene, but i did'nt investigate this one.
Here's a suggestion in pseudocode. I'm assuming you store a story identifier with the search terms in the index, so that you can retrieve it with the search results.
def search_strings_matching(story_id_to_match, search_strings):
result = set()
for s in search_strings:
result_story_ids = query_index(s) # query_index returns an id iterable
if story_id_to_match in result_story_ids:
result.add(s)
return result
This is probably less interesting to you now, since you've already solved your problem, but what you're describing sounds like Prospective Search, which is what you call it when you have the query first and you want to match it against documents as they come along.
Lucene's MemoryIndex is a class that was designed specifically for something like this, and in your case it might be efficient enough to run many queries against a single document.
This has nothing to do with Python, though. You'd probably be better off writing something like this in java.
If you are writing Python on AppEngine, you can use the AppEngine Prospective Search Service to achieve exactly what you are trying to do here. See: http://code.google.com/appengine/docs/python/prospectivesearch/overview.html
I want to process a medium to large number of text snippets using a spelling/grammar checker to get a rough approximation and ranking of their "quality." Speed is not really of concern either, so I think the easiest way is to write a script that passes off the snippets to Microsoft Word (2007) and runs its spelling and grammar checker on them.
Is there a way to do this from a script (specifically, Python)? What is a good resource for learning about controlling Word programmatically?
If not, I suppose I can try something from Open Source Grammar Checker (SO).
Update
In response to Chris' answer, is there at least a way to a) open a file (containing the snippet(s)), b) run a VBA script from inside Word that calls the spelling and grammar checker, and c) return some indication of the "score" of the snippet(s)?
Update 2
I've added an answer which seems to work, but if anyone has other suggestions I'll keep this question open for some time.
It took some digging, but I think I found a useful solution. Following the advice at http://www.nabble.com/Edit-a-Word-document-programmatically-td19974320.html I'm using the win32com module (if the SourceForge link doesn't work, according to this Stack Overflow answer you can use pip to get the module), which allows access to Word's COM objects. The following code demonstrates this nicely:
import win32com.client, os
wdDoNotSaveChanges = 0
path = os.path.abspath('snippet.txt')
snippet = 'Jon Skeet lieks ponies. I can haz reputashunz? '
snippet += 'This is a correct sentence.'
file = open(path, 'w')
file.write(snippet)
file.close()
app = win32com.client.gencache.EnsureDispatch('Word.Application')
doc = app.Documents.Open(path)
print "Grammar: %d" % (doc.GrammaticalErrors.Count,)
print "Spelling: %d" % (doc.SpellingErrors.Count,)
app.Quit(wdDoNotSaveChanges)
which produces
Grammar: 2
Spelling: 3
which match the results when invoking the check manually from Word.