Since the Fall update, GAE now supports partial searching. Per the documentation: "The API supports partial text matching on string fields".
This seems to be a very popular request, per many threads:
Partial matching GAE search API
Does GAE Datastore support 'partial text search'?
So I would assume a search for 'pyt' would now return 'python'
Has anyone gotten this to work? Doesn't work for me. I'm curious if there's some setting required, like the ~ for stemming.
"The API supports partial text matching on string fields" in https://cloud.google.com/appengine/docs/python/search/ refers to matching by tokens. Specifically, see https://cloud.google.com/appengine/docs/python/search/#Python_Tokenizing_string_fields ...:
The string is split into tokens wherever whitespace or special
characters (punctuation marks, hash sign, etc.) appear. The index will
include an entry for each token. This enables you to search for
keywords and phrases comprising only part of a field's value.
Therefore your assumption:
So I would assume a search for 'pyt' would now return 'python'
is ill-founded -- "partial search" means parts of a document (a subset of the tokens in a text field thereof), not parts of each token (that would cause a combinatorial explosion, e.g the single token python would have to be indexed as each and every one of the entries:
p
py
pyt
pyth
pytho
python
y
yt
yth
ytho
ython
t
th
tho
thon
h
ho
hon
o
on
n
If you want that, it's easy enough to write your own code to produce the explosion (producing a pseudo-document with all of these substrings from a real starting document) -- but, for any non-trivial starting document, you may easily end up either paying for a ridiculous amount of resources, or hitting a hard ceiling of absolute maximum quotas.
Hint: if you do a web search for "pyt", do you find docs containing "python"? Try -- the former gives 10 million hits (Peninsula Youth Theater, Michael Jackson's P.Y.T. (Pretty Young Thing), etc etc), the latter, 180 million hits (the language, the snake, the comedy group:-).
Related
Is it possible to use whoosh as a matcher without building an index?
My situation is that I have subscriptions pre-defined with strings, and documents coming through in a stream. I check each document matches the subscriptions and send them if so. I don't need to store the documents, or recall them later. Once they've been sent to the subscriptions, they can be discarded.
Currently just using simple matching, but as consumers ask for searches based on fields, and/or logic, etc, I'm wondering if it's possible to use a whoosh matcher and allow whoosh query syntax for this.
I could build an index for each document, query it, and then throw it away, but that seems very wasteful, is it possible to directly construct a Matcher? I couldn't find any docs or questions online indicating a way to do this and my attempts haven't worked.
Alternatively, is this just the wrong library for this task, and is there something better suited?
The short answer is no.
Search indices and matchers work quite differently. For example, if searching for the phrase "hello world", a matcher would simply check the document text contains the substring "hello world". A search index cannot do this, it would have to check every document, and that be very slow.
As documents are added, every word in them is added to the index for that word. So the index for "hello" will say that document 1 matches at position 0, and the index for "world" will say that document 1 matches at position 6. And a search for "hello world" will find all document IDs in the "hello" index, then all in the "world" index, and see if any have a position for "world" which is 6 digits after the position for "hello".
So it's a completely orthogonal way of doing things in whoosh vs a matcher.
It is possible to do this with whoosh, using a new index for each document, like so:
def matches_subscription(doc: Document, q: Query) -> bool:
with RamStorage() as store:
ix = store.create_index(schema)
writer = ix.writer()
writer.add_document(
title=doc.title,
description=doc.description,
keywords=doc.keywords
)
writer.commit()
with ix.searcher() as searcher:
results = searcher.search(q)
return bool(results)
This takes about 800 milliseconds per check, which is quite slow.
A better solution is to build a parser with pyparsing, anbd then create your own nested query classes which can do the matching, better fitting your specific search queries. It's quite extendable too that way. That can bring it down to ~40 microseconds, so, 20,000 times faster.
I need to grab a url from a text file.
The URL is stored in a string like so: 'URL=http://example.net'.
Is there anyway I could grab everything after the = char up until the . in '.net'?
Could I use the re module?
text = """A key feature of effective analytics infrastructure in healthcare is a metadata-driven architecture. In this article, three best practice scenarios are discussed: https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare Automating ETL processes so data analysts have more time to listen and help end users , https://www.google.com/, https://www.facebook.com/, https://twitter.com
code below catches all urls in text and returns urls in list."""
urls = re.findall('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+', text)
print(urls)
output:
[
'https://www.healthcatalyst.com/clinical-applications-of-machine-learning-in-healthcare',
'https://www.google.com/',
'https://www.facebook.com/',
'https://twitter.com'
]
i dont have much information but i will try to help with what i got im assuming that URL= is part of the string in that case you can do this
re.findall(r'URL=(.*?).', STRINGNAMEHERE)
Let me go more into detail about (.*?) the dot means Any character (except newline character) the star means zero or more occurences and the ? is hard to explain but heres an example from the docs "Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’." the brackets place it all into a group. All this togethear basicallly means it will find everything inbettween URL= and .
You don't need RegEx'es (the re module) for such a simple task.
If the string you have is of the form:
'URL=http://example.net'
Then you can solve this using basic Python in numerous ways, one of them being:
file_line = 'URL=http://example.net'
start_position = file_line.find('=') + 1 # this gives you the first position after =
end_position = file_line.find('.')
# this extracts from the start_position up to but not including end_position
url = file_line[start_position:end_position]
Of course that this is just going to extract one URL. Assuming that you're working with a large text, where you'd want to extract all URLs, you'll want to put this logic into a function so that you can reuse it, and build around it (achieve iteration via the while or for loops, and, depending on how you're iterating, keep track of the position of the last extracted URL and so on).
Word of advice
This question has been answered quite a lot on this forum, by very skilled people, in numerous ways, for instance: here, here, here and here, to a level of detail that you'd be amazed. And these are not all, I just picked the first few that popped up in my search results.
Given that (at the time of posting this question) you're a new contributor to this site, my friendly advice would be to invest some effort into finding such answers. It's a crucial skill, that you can't do without in the world of programming.
Remember, that whatever problem it is that you are encountering, there is a very high chance that somebody on this forum had already encountered it, and received an answer, you just need to find it.
Please try this. It worked for me.
import re
s='url=http://example.net'
print(re.findall(r"=(.*)\.",s)[0])
Given a DBpedia resource, I want to find the entire taxonomy till root.
For example, if I were to say in plain English, for Barack Obama I want to know the entire taxonomy which goes as Barack Obama → Politician → Person → Being.
I have written the following recursive function for the same:
import requests
import json
from SPARQLWrapper import SPARQLWrapper, JSON
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
def get_taxonomy(results,entity,hypernym_list):
'''This recursive function keeps on fetching the hypernyms of the
DBpedia resource recursively till the highest concept or root is reached'''
if entity == 'null':
return hypernym_list
else :
query = ''' SELECT ?hypernyms WHERE {<'''+entity+'''> <http://purl.org/linguistics/gold/hypernym> ?hypernyms .}
'''
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
hypernym_list.append(result['hypernyms']['value'])
if len(results["results"]["bindings"]) == 0:
return get_taxonomy(results,'null',hypernym_list)
return get_taxonomy(results,results["results"]["bindings"][0]['hypernyms']['value'],hypernym_list)
def get_taxonomy_of_resource(dbpedia_resource):
list_for_hypernyms=[]
results = {}
results["results"]={}
results["results"]["bindings"]=[1,2,3]
taxonomy_list = get_taxonomy(results,dbpedia_resource,list_for_hypernyms)
return taxonomy_list
The code works for the following input:
get_taxonomy_of_resource('http://dbpedia.org/resource/Barack_Obama')
Output:
['http://dbpedia.org/resource/Politician',
'http://dbpedia.org/resource/Person', 'http://dbpedia.org/resource/Being']
Problem :
But for following output it only gives hypernym till one level above and stops:
get_taxonomy_of_resource('http://dbpedia.org/resource/Steve_Jobs')
Output:
['http://dbpedia.org/resource/Entrepreneur']
Research:
On doing some research on their site dbpedia.org/page/<term> I realized that the reason it stopped at Entrepreneur is that when I click on this resource on their site, it takes me to resource 'Entrepreneurship' and state its hypernym as 'Process'. So now my problem has been directed to the question:
How do I know that Entrepreneur is directing to Entrepreneurship even though both are valid DBpedia entities? My recursive function fails due to this as in next iteration it attempts to find hypernym for Entrepreneur rather than Entrepreneurship.
Any help is duly appreciated
I have faced this same problem before while writing a program to generate taxonomies and my solution was to use in addition wiktionary when my main ressource failed to provide a hypernym.
The wiktionary dump can be downloaded and parsed into a python dictionary.
For example, the wiktionary entry for 'entrepreneur' contains the following:
Noun
entrepreneur (plural entrepreneurs)
A person who organizes and operates a business venture and assumes much of the associated risk.
From this definition, the hypernym ('person') can be extracted.
Naturally, this approach entails writing code to extract the hypernym from a definition (a task which is at times easy and at times hard depending on the wording of the definition).
This approach provides a fallback routine in cases when the main ressource (DBpedia in your case) fails to provide a hypernym.
Finally, as stated by AKSW, it is good to have a method to capture incorrect hypernym relations (e.g. Entrepreneur - Process). There is the area of textual entailment in natural language processing, which studies methods for determining if a statement contradicts (or implies or .. ) another statement.
I am trying find particular tag in an xbrl file. I originally tried using python-xbrl package, but it is not exactly what I want, so I based my code on the one available from the package.
Here's the part of xbrl that I am interested in
<us-gaap:LiabilitiesCurrent contextRef="eol_PE2035----1510-Q0008_STD_0_20150627_0" unitRef="iso4217_USD" decimals="-6" id="id_5025426_6FEF05CB-B19C-4D84-AAF1-79B431731049_1_24">65285000000</us-gaap:LiabilitiesCurrent>
<us-gaap:Liabilities contextRef="eol_PE2035----1510-Q0008_STD_0_20150627_0" unitRef="iso4217_USD" decimals="-6" id="id_5025426_6FEF05CB-B19C-4D84-AAF1-79B431731049_1_28">147474000000</us-gaap:Liabilities>
Here is the code
python-xbrl package is based on beautifulsoup4 and several other packages.
liabilities = xbrl.find_all(name=re.compile("(us-gaap:Liabilities)",
re.IGNORECASE | re.MULTILINE))
I get the value for us-gaap:LiabilitiesCurrent, but I want value for us-gaap:Liabilities.
Right now as soon as it finds a match it, stores it. But in many cases its the wrong match due to the tag format in xbrl. I believe I need to change re.compile() part to make it work correctly.
I'd be very wary about using this approach to parsing XBRL (or indeed, any XML with namespaces in it). "us-gaap:Liabilities" is a QName, consisting of a prefix ("us-gaap") and a local name ("Liabilities"). The prefix is just a shorthand for a full namespace URI such as "http://fasb.org/us-gaap/2015-01-31", which is defined by a namespace declaration, usually at the top of the document. If you look at the top of the document you'll see something like:
xmlns:us-gaap="http://fasb.org/us-gaap/2015-01-31"
This means that within the scope of this document, "us-gaap" is taken to mean that full namespace URI.
XML creators are free to use whatever prefixes they want, so there is no guarantee that the element will actually be called "us-gaap:Liabilities" across all documents that you encounter.
beautifulsoup4 has very limited support for namespaces, so I wouldn't recommend it as a starting point for building an XBRL processor. It may be worth taking a look at the Arelle project, which is a full XBRL processor, and will make it easier to do other tasks such as finding the labels and other information associated with facts in the taxonomy.
Try it with a $ dollar sign at the end to indicate not to match anything else following the dollar sign:
liabilities = xbrl.find_all(name=re.compile("(us-gaap:Liabilities$)",
re.IGNORECASE | re.MULTILINE))
I'm looking for a way to make a function in python where you pass in a string and it returns whether it's spelled correctly. I don't want to check against a dictionary. Instead, I want it to check Google's spelling suggestions. That way, celebrity names and other various proper nouns will count as being spelled correctly.
Here's where I'm at so far. It works most of the time, but it messes up with some celebrity names. For example, things like "cee lo green" or "posner" get marked as incorrect.
import httplib
import xml.dom.minidom
data = """
<spellrequest textalreadyclipped="0" ignoredups="0" ignoredigits="1" ignoreallcaps="1">
<text> %s </text>
</spellrequest>
"""
def spellCheck(word_to_spell):
con = httplib.HTTPSConnection("www.google.com")
con.request("POST", "/tbproxy/spell?lang=en", data % word_to_spell)
response = con.getresponse()
dom = xml.dom.minidom.parseString(response.read())
dom_data = dom.getElementsByTagName('spellresult')[0]
if dom_data.childNodes:
for child_node in dom_data.childNodes:
result = child_node.firstChild.data.split()
for word in result:
if word_to_spell.upper() == word.upper():
return True;
return False;
else:
return True;
Peter Norvig tells you how implement spell checker in Python.
Rather than sticking to Mr. Google, try out other big fellows.
If you really want to stick with search engines which count page requests, Yahoo and Bing are providing some excellent features. Yahoo is directly providing spell checking services using YQL tables (Free: 5000 request/day and non-commercial).
You have good number of Python API's which are capable to do a lot similar magic including on nouns that you mentioned (sometimes may turn around - after all its somewhere based upon probability)
So, in the second case, you got a good list (totally free)
GNU - Aspell (Even got python bindings)
PyEnchant
Whoosh (It does a lot more than spell checking but I think it has some edge on it.)
I hope they should give you a clear idea of how things work.
Actually spell checking involves very complex mechanisms in the areas of Machine learning, AI, NLP.. etc a lot more. So, companies like Google/ Yahoo don't really offer their API entirely free.