Python Whoosh - Combining Results

Python Whoosh - Combining Results - python

Thanks for taking the time to answer this in advance. I'm relatively new to both Python (3.6) and Whoosh (2.7.4), so forgive me if I'm missing something obvious.
Whoosh 2.7.4 — Combining Results Error
I'm trying to follow the instructions in the Whoosh Documentation here on How to Search > Combining Results. However, I'm really lost in this section:
# Get the terms searched for
termset = set()
userquery.existing_terms(termset)
As I run my code, it produces this error:
'set' object has no attribute 'schema'
What went wrong?
I also looked into the docs about the Whoosh API on this, but I just got more confused about the role of ixreader. (Or is it index.Index.reader()?) Shrugs
A Peek at My Code
Schema
schema = Schema(uid=ID(unique=True, stored=True), # unique ID
indice=ID(stored=True, sortable=True),
title=TEXT,
author=TEXT,
body=TEXT(analyzer=LanguageAnalyzer(lang)),
hashtag=KEYWORD(lowercase=True, commas=True,
scorable=True)
)
The relevant fieldnames are the 'hashtag' and 'body'. Hashtags are user selected keywords for each document, and body is the text in the document. Pretty self-explanatory, no?
Search Function
Much of this is lifted directly from Whoosh Doc. Note, dic is just a dictionary containing the query string. Also, it should be noted that the error occurs during userquery.existing_terms(termset), so if the rest of it is bunk, my apologies, I haven't gotten that far.
try:
ix = index.open_dir(self.w_path, indexname=lang)
qp = QueryParser('body', schema=ix.schema)
userquery = qp.parse(dic['string'])
termset = set()
userquery.existing_terms(termset)
bbq = Or([Term('hashtag', text) for fieldname, text
in termset if fieldname == 'body'])
s = ix.searcher()
results = s.search(bbq, limit=5)
allresults = s.search(userquery, limit=10)
results.upgrade_and_extend(allresults)
for r in results:
print(r)
except Exception as e:
print('failed to search')
print(e)
return False
finally:
s.close()
Goal of My Code
I am taking pages from different files (pdf, epub, etc) and storing each page's text as a separate 'document' in a whoosh index (i.e. the field 'body'). Each 'document' is also labeled with a unique ID (uid) that allows me to take the search Results and determine the pdf file from which it comes and which pages contain the search hit (e.g. the document from page 2 of "1.pdf" has the uid 1.2). In other words, I want to give the user a list of page numbers that contain the search term and perhaps the pages with the most hits. For each file, the only document that has hashtags (or keywords) is the document with a uid ending in zero (i.e. page zero, e.g. uid 1.0 for "1.pdf"). Page zero may or may not have a 'body' too (e.g. the publish date, author names, summary, etc). I did this in order to prevent one document with more pages to be dramatically ranked higher from another with considerably less pages because of the multiple repetitions of the keyword over each 'document' (i.e. page).
Ultimately, I just want the code to elevate documents with the hashtag over documents with just search hits in the body text. I thought about just boosting the hashtag field instead, but I'm not sure what the mechanics of that is and the documentation recommends against this.
Suggestions and corrections would be greatly appreciated. Thank you again!

The code from your link doesn't look right to me. It too gives me the same error. Try rearranging your code as follows:
try:
ix = index.open_dir(self.w_path, indexname=lang)
qp = QueryParser('body', schema=ix.schema)
userquery = qp.parse(dic['string'])
s = ix.searcher()
allresults = s.search(userquery, limit=10)
termset = userquery.existing_terms(s.reader())
bbq = Or([Term('hashtag', text) for fieldname, text in termset if fieldname == 'body'])
results = s.search(bbq, limit=5)
results.upgrade_and_extend(allresults)
for r in results:
print(r)
except Exception as e:
print('failed to search')
print(e)
return False
finally:
s.close()
existing_terms requires a reader so I create the searcher first and give its reader to it.
As for boosting a field, the mechanics are quite simple:
schema = Schema(title=TEXT(field_boost=2.0), body=TEXT).
Add a sufficiently high boost to bring hashtag documents to the top and be sure to apply a single query on both body and hashtag fields.
Deciding between boosting or combining depends on whether you want all matching hashtag documents to be always, absolutely at the top before any other matches show. If so, combine. If instead you prefer to strike a balance in relevance albeit with a stronger bias for hashtags, boost.

Related

API Twitter: I want to exclude an specific argument using api.search

I need to get tweets that contains a specific subject. I want to see what customers are talking about the company 'this_company', but I don't want the tweets from 'this_company'. Therefore, I want to exclude screen_name = 'this_company'
I'm using:
posts = api.search(q = 'this_company', lan='en', tweet_mode = 'extended', since = '2020-07-10'
I tried to put screen_name != 'this_company', but it doesn't work (I don't think I can pass an argument with !=).
Does someone know how I can do that?

I believe you can use operators directly in the query, as per the search API. (Some examples here.)
So you could search with q = "this_company -from:this_company"
(Untested code- some quoting might be necessary.)

Tweepy not returning full tweet: tweet_mode = 'extended' not working

Hello I am trying to scrape the tweets of a certain user using tweepy.
Here is my code :
tweets = []
username = 'example'
count = 140 #nb of tweets
try:
# Pulling individual tweets from query
for tweet in api.user_timeline(id=username, count=count, include_rts = False):
# Adding to list that contains all tweets
tweets.append((tweet.text))
except BaseException as e:
print('failed on_status,',str(e))
time.sleep(3)
The problem I am having is the tweets are coming back unfinished with "..." at the end.
I think I've looked at all the other similar problems on stack overflow and elsewhere but nothing works. Most do not concern me because I am NOT dealing with retweets .
I have tried putting tweet_mode = 'extended' and/or tweet.full_text or tweet._json['extended_tweet']['full_text'] in different combinations .
I don't get an error message but nothing works, just an empty list in return.
And It looks like the documentation is out of date because it says nothing about the 'tweet_mode' nor the 'include_rts' parameter :
Has anyone managed to get the full text of each tweet?? I'm really stuck on this seemingly simple problem and am losing my hair so I would appreciate any advice :D
Thanks in advance!!!

TL;DR: You're most likely running into a Rate Limiting issue. And use the full_text attribute.
Long version:
First,
The problem I am having is the tweets are coming back unfinished with "..." at the end.
From the Tweepy documentation on Extended Tweets, this is expected:
Compatibility mode
... It will also be discernible that the text attribute of the Status object is truncated as it will be suffixed with an ellipsis character, a space, and a shortened self-permalink URL to the Tweet.
Wrt
And It looks like the documentation is out of date because it says nothing about the 'tweet_mode' nor the 'include_rts' parameter :
They haven't explicitly added it to the documentation of each method, however, they specify that tweet_mode is added as a param:
Standard API methods
Any tweepy.API method that returns a Status object accepts a new tweet_mode parameter. Valid values for this parameter are compat and extended , which give compatibility mode and extended mode, respectively. The default mode (if no parameter is provided) is compatibility mode.
So without tweet_mode added to the call, you do get the tweets with partial text? And with it, all you get is an empty list? If you remove it and immediately retry, verify that you still get an empty list. ie, once you get an empty list result, check if you keep getting an empty list even when you change the params back to the one which worked.
Based on bug #1329 - API.user_timeline sometimes returns an empty list - it appears to be a Rate Limiting issue:
Harmon758 commented on Feb 13
This API limitation would manifest itself as exactly the issue you're describing.
Even if it was working, it's in the full_text attribute, not the usual text. So the line
tweets.append((tweet.text))
should be
tweets.append(tweet.full_text)
(and you can skip the extra enclosing ())
Btw, if you're not interested in retweets, see this example for the correct way to handle them:
Given an existing tweepy.API object and id for a Tweet, the following can be used to print the full text of the Tweet, or if it’s a Retweet, the full text of the Retweeted Tweet:
status = api.get_status(id, tweet_mode="extended")
try:
print(status.retweeted_status.full_text)
except AttributeError: # Not a Retweet
print(status.full_text)
If status is a Retweet, status.full_text could be truncated.

As per the twitter API v2:
tweet_mode does not work at all. You need to add expansions=referenced_tweets.id. Then in the response, search for includes. You can find all the truncated tweets as full tweets in the includes. You will still see the truncated tweets in response but do not worry about it.

extract disease names from raw data which has no pattern

I want to extract disease words from medical data to make a disease word dictionary (consider the notes written by doctors, test results). I'm using python. I tried the following ways:
Used google API to check whether the word is a disease or not depending on the results. It didn't go well because it was extracting medical words too and i even tried modify the search and also i had to buy google CSE which i feel is costly because i have huge data. Its a huge code to include in the post.
Used weka to predict the words but the data which i have is normal text data and wont follow any rules and not in ARFF or CSV type.
Tried checking NER for extracting disease words. But, all the models which i have seen needed a predefined dictionary to search and perform tf-idf on the input data. I don't have such kind of dictionary.
In all the models which i have seen they suggest me to tokenize then POS for the data which I did and couldn't find another way to extract only the disease words.
I even tried extracting only the nouns which didn't do well because other medical terms were also considered as nouns.
My data is in the following way and doesn't follow the same way in the whole document:
After conducting clinical reviews the patient was suffering with
diabetes,htn which was revealed when a complete blood picture of the
patient's blood was done. He was advised to take PRINIVIL TABS 20 MG
(LISINOPRIL) 1.
Believe me, I googled a lot and couldn't come with a perfect solution. Please suggest a way for me to move forward.
The following is one of the approaches I tried which extracted the medical terms too. Sorry, the code looks a bit clumsy and i am positing the main function in it as posting the whole code will be veryy lenghty. Look the search_word variable main logic lies there :
def search(self,wordd): #implemented google custom search engine api
#responseData = 'None'
global flag
global page
search_word="\"is+%s+an+organ?\"" %(wordd)
search_word=str(search_word)
if flag == 1:
search_word="\"%s+is+a+disease\"" %(wordd)
try: #searching google for the word
url = 'https://www.googleapis.com/customsearch/v1?key=AIzaSyAUGKCa2oHSYeZynSMD6zElBKUrg596G_k&cx=00262342415310682663:xy7prswaherw&num=3&q='+search_word
print url
data = urllib2.urlopen(url)
response_data = json.load(data)
results=response_data['queries']['request'][0]['totalResults']
results_count=int(results)
print "the results is: ",results_count
if(results_count == 0):
print "no results found"
flag = 0
return 0
else:
return 1
#except IOError:
#print "network issues!"
except ValueError:
print "Problem while decoding JSON data!"

Does gdata-python-client allow fulltext queries with multiple terms?

I'm attempting to search for contacts via the Google Contacts API, using multiple search terms. Searching by a single term works fine and returns contact(s):
query = gdata.contacts.client.ContactsQuery()
query.text_query = '1048'
feed = gd_client.GetContacts(q=query)
for entry in feed.entry:
# Do stuff
However, I would like to search by multiple terms:
query = gdata.contacts.client.ContactsQuery()
query.text_query = '1048 1049 1050'
feed = gd_client.GetContacts(q=query)
When I do this, no results are returned, and I've found so far that spaces are being replaced by + signs:
https://www.google.com/m8/feeds/contacts/default/full?q=3066+3068+3073+3074
I'm digging through the gdata-client-python code right now to find where it's building the query string, but wanted to toss the question out there as well.
According to the docs, both types of search are supported by the API, and I've seen some similar docs when searching through related APIs (Docs, Calendar, etc):
https://developers.google.com/google-apps/contacts/v3/reference#contacts-query-parameters-reference
Thanks!

Looks like I was mistaken in my understanding of the gdata query string functionality.
https://developers.google.com/gdata/docs/2.0/reference?hl=en#Queries
'The service returns all entries that match all of the search terms (like using AND between terms).'
Helps to read the docs and understand them!

How to implement full text search in Django?

I would like to implement a search function in a django blogging application. The status quo is that I have a list of strings supplied by the user and the queryset is narrowed down by each string to include only those objects that match the string.
See:
if request.method == "POST":
form = SearchForm(request.POST)
if form.is_valid():
posts = Post.objects.all()
for string in form.cleaned_data['query'].split():
posts = posts.filter(
Q(title__icontains=string) |
Q(text__icontains=string) |
Q(tags__name__exact=string)
)
return archive_index(request, queryset=posts, date_field='date')
Now, what if I didn't want do concatenate each word that is searched for by a logical AND but with a logical OR? How would I do that? Is there a way to do that with Django's own Queryset methods or does one have to fall back to raw SQL queries?
In general, is it a proper solution to do full text search like this or would you recommend using a search engine like Solr, Whoosh or Xapian. What are their benefits?

I suggest you to adopt a search engine.
We've used Haystack search, a modular search application for django supporting many search engines (Solr, Xapian, Whoosh, etc...)
Advantages:
Faster
perform search queries even without querying the database.
Highlight searched terms
"More like this" functionality
Spelling suggestions
Better ranking
etc...
Disadvantages:
Search Indexes can grow in size pretty fast
One of the best search engines (Solr) run as a Java servlet (Xapian does not)
We're pretty happy with this solution and it's pretty easy to implement.

Actually, the query you have posted does use OR rather than AND - you're using \ to separate the Q objects. AND would be &.
In general, I would highly recommend using a proper search engine. We have had good success with Haystack on top of Solr - Haystack manages all the Solr configuration, and exposes a nice API very similar to Django's own ORM.

Answer to your general question: Definitely use a proper application for this.
With your query, you always examine the whole content of the fields (title, text, tags). You gain no benefit from indexes, etc.
With a proper full text search engine (or whatever you call it), text (words) is (are) indexed every time you insert new records. So queries will be a lot faster especially when your database grows.

SOLR is very easy to setup and integrate with Django. Haystack makes it even simpler.

For full text search in Python, look at PyLucene. It allows for very complex queries. The main problem here is that you must find a way to tell your search engine which pages changed and update the index eventually.
Alternatively, you can use Google Sitemaps to tell Google to index your site faster and then embed a custom query field in your site. The advantage here is that you just need to tell Google the changed pages and Google will do all the hard work (indexing, parsing the queries, etc). On top of that, most people are used to use Google to search plus it will keep your site current in the global Google searches, too.

I think full text search on an application level is more a matter of what you have and how you expect it to scale. If you run a small site with low usage I think it might be more affordable to put some time into making an custom full text search rather than installing an application to perform the search for you. And application would create more dependency, maintenance and extra effort when storing data. By making your search yourself and you can build in nice custom features. Like for example, if your text exactly matches one title you can direct the user to that page instead of showing the results. Another would be to allow title: or author: prefixes to keywords.
Here is a method I've used for generating relevant search results from a web query.
import shlex
class WeightedGroup:
def __init__(self):
# using a dictionary will make the results not paginate
# but it will be a lot faster when storing data
self.data = {}
def list(self, max_len=0):
# returns a sorted list of the items with heaviest weight first
res = []
while len(self.data) != 0:
nominated_weight = 0
for item, weight in self.data.iteritems():
if weight > nominated_weight:
nominated = item
nominated_weight = weight
self.data.pop(nominated)
res.append(nominated)
if len(res) == max_len:
return res
return res
def append(self, weight, item):
if item in self.data:
self.data[item] += weight
else:
self.data[item] = weight
def search(searchtext):
candidates = WeightedGroup()
for arg in shlex.split(searchtext): # shlex understand quotes
# Search TITLE
# order by date so we get most recent posts
query = Post.objects.filter_by(title__icontains=arg).order_by('-date')
arg_hits = query.count() # count is cheap
if arg_hits > 1000:
continue # skip keywords which has too many hits
# Each of these are expensive as it would transfer data
# from the db and build a python object,
for post in query[:50]: # so we limit it to 50 for example
# more hits a keyword has the lesser it's relevant
candidates.append(100.0 / arg_hits, post.post_id)
# TODO add searchs for other areas
# Weight might also be adjusted with number of hits within the text
# or perhaps you can find other metrics to value an post higher,
# like number of views
# candidates can contain a lot of stuff now, show most relevant only
sorted_result = Post.objects.filter_by(post_id__in=candidates.list(20))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.