Unable to iteratively call yahoo's term extractor api using python - python

I am trying to loop through some 50-odd files in a directory. Each file has some text for which i am trying to find the keywords using Yahoo Term Extractor. I am able to extract text from each file, but I am not able to iteratively call the API using the text as input. Only the keywords for the first file is displayed.
Here is my code snippet:
in 'comments' list, I have extracted and stored the text from each file.
for c in comments:
print "building query"
dataDict = [ ('appid', appid), ('context', c)]
queryData = urllib.urlencode(dataDict)
request.add_data(queryData)
print "fetching result"
result = OPENER.open(request).read()
print result
time.sleep(1)

Well I don't know anything about the Yahoo Term Extractor, but I'd presume that your call request.add_data(queryData) simply tacks on another data set with each iteration of your loop. And then the call to OPENER.open(request).read() would probably only process the results of the first data set. So either your request object can only hold one query, or your OPENER object's inner workings can only process one query, it's as simple as that.
Actually a third reason comes to mind now that I read the documentation provided at your link, and this is probably the true one:
RATE LIMITS
The Term Extraction service is limited to 5,000 queries per IP address per day and to noncommercial use. See information on rate limiting.
So it would make sense that the API would limit your usage to one query at a time, and not allow you to flood a bunch of queries in a single request.
In any event, I'd assume you could fix your problem in a "naive" way by having many request variables instead of just one, or maybe just creating a new request with every iteration of your loop. If you're not worried about storing your results, and just trying to debug, you could try:
for c in comments:
print "building query"
dataDict = [ ('appid', appid), ('context', c)]
queryData = urllib.urlencode(dataDict)
request = urllib2.Request() # I don't know how to initialize this variable, do it yourself
request.add_data(queryData)
print "fetching result"
result = OPENER.open(request).read()
print result
time.sleep(1)
Again, I don't know about the Yahoo Term Extractor (nor do I really have time to research it) so there may very well be a better, more native way to do this. If you post more details of your code (i.e. what classes are the request and OPENER objects coming from) then I might be able to elaborate on this.

Related

Soundcloud API python issues with linked partitioning

On the Soundcloud API guide (https://developers.soundcloud.com/docs/api/guide#pagination)
the example given for reading more than 100 piece of data is as follows:
# get first 100 tracks
tracks = client.get('/tracks', order='created_at', limit=page_size)
for track in tracks:
print track.title
# start paging through results, 100 at a time
tracks = client.get('/tracks', order='created_at', limit=page_size,
linked_partitioning=1)
for track in tracks:
print track.title
I'm pretty certain this is wrong as I found that 'tracks.collection' needs referencing rather than just 'tracks'. Based on the GitHub python soundcloud API wiki it should look more like this:
tracks = client.get('/tracks', order='created_at',limit=10,linked_partitioning=1)
while tracks.collection != None:
for track in tracks.collection:
print(track.playback_count)
tracks = tracks.GetNextPartition()
Where I have removed the indent from the last line (I think there is an error on the wiki it is within the for loop which makes no sense to me). This works for the first loop. However, this doesn't work for successive pages because the "GetNextPartition()" function is not found. I've tried the last line as:
tracks = tracks.collection.GetNextPartition()
...but no success.
Maybe I'm getting versions mixed up? But I'm trying to run this with Python 3.4 after downloading the version from here: https://github.com/soundcloud/soundcloud-python
Any help much appreciated!
For anyone that cares, I found this solution on the SoundCloud developer forum. It is slightly modified from the original case (searching for tracks) to list my own followers. The trick is to call the client.get function repeatedly, passing the previously returned "users.next_href" as the request that points to the next page of results. Hooray!
pgsize=200
c=1
me = client.get('/me')
#first call to get a page of followers
users = client.get('/users/%d/followers' % me.id, limit=pgsize, order='id',
linked_partitioning=1)
for user in users.collection:
print(c,user.username)
c=c+1
#linked_partitioning means .next_href exists
while users.next_href != None:
#pass the contents of users.next_href that contains 'cursor=' to
#locate next page of results
users = client.get(users.next_href, limit=pgsize, order='id',
linked_partitioning=1)
for user in users.collection:
print(c,user.username)
c=c+1

How to disable query cache?

First of all, sorry for not 100% clearly questions title.
It is easier to explain with few lines of code:
query = {...}
while True:
elastic_response = elastic_client.search(elastic_index, body=query, request_cache=False)
if elastic_response["hits"]["total"]) == 0:
break
else:
for doc in elastic_response["hits"]["hits"]:
print("delete {}".format(doc["_id"]))
elastic_client.delete(index=elastic_index, doc_type=doc["_type"], id=doc["_id"])
I make a search, then delete all the docs and then do the search again to get the next bunch.
BUT the search query gives me the same docs! And this results in 404 exception on delete. It has to be some kind of cache, but i does not found anything, "request_cache" doesn't help.
I can probably refactor this code to use batch delete, but i want to understand what is wrong here
P.S. i'm using the official python client
If using a sleep() after the deletes makes the documents go away, then it's not about cache. It's about the refresh_interval and the near real timeness or Elasticsearch.
So, call _refresh after your code leaves the for loop. Also, don't delete document by document, but create a _bulk request where you delete all your documents in batches, depending on how many they are.

Flask template streaming with Jinja

I have a flask application. On a particular view, I show a table with about 100k rows in total. It's understandably taking a long time for the page to load, and I'm looking for ways to improve it. So far I've determined I query the database and get a result fairly quickly. I think the problem lies in rendering the actual page. I've found this page on streaming and am trying to work with that, but keep running into problems. I've tried the stream_template solution provided there with this code:
#app.route('/thing/matches', methods = ['GET', 'POST'])
#roles_accepted('admin', 'team')
def r_matches():
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).all()
return Response(stream_template('/retailer/matches.html',
dashboard_title = g.name,
match_show_option = True,
match_form = form,
matches = matches))
def stream_template(template_name, **context):
app.update_template_context(context)
t = app.jinja_env.get_template(template_name)
rv = t.stream(context)
rv.enable_buffering(5)
return rv
The Match query is the one that returns 100k+ items. However, whenever I run this the page just shows up blank with nothing there. I've also tried the solution with streaming the data to a json and loading it via ajax, but nothing seems to be in the json file either! Here's what that solution looks like:
#app.route('/large.json')
def generate_large_json():
def generate():
app.logger.info("Generating JSON")
matches = Product.query.join(Category).filter(
Product.retailer == g.retailer,
Product.match != None).order_by(Product.name)
for match in matches:
yield json.dumps(match)
app.logger.info("Sending file response")
return Response(stream_with_context(generate()))
Another solution I was looking at was for pagination. This solution works well, except I need to be able to sort through the entire dataset by headers, and couldn't find a way to do that without rendering the whole dataset in the table then using JQuery for sorting/pagination.
The file I get by going to /large.json is always empty. Please help or recommend another way to display such a large data set!
Edit: I got the generate() part to work and updated the code.
The problem in both cases is almost certainly that you are hanging on building 100K+ Match items and storing them in memory. You will want to stream the results from the DB as well using yield_per. However, only PostGres+psycopg2 support the necessary stream_result argument (here's a way to do it with MySQL):
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).yield_per(10)
# Stream ten results at a time
An alternative
If you are using Flask-SQLAlchemy you can make use of its Pagination class to paginate your query server-side and not load all 100K+ entries into the browser. This has the added advantage of not requiring the browser to manage all of the DOM entries (assuming you are doing the HTML streaming option).
See also
SQLAlchemy: Scan huge tables using ORM?
How to Use SQLAlchemy Magic to Cut Peak Memory and Server Costs in Half

Checking if A follows B on twitter using Tweepy/Python

I have a list of a few thousand twitter ids and I would like to check who follows who in this network.
I used Tweepy to get the accounts using something like:
ids = {}
for i in list_of_accounts:
for page in tweepy.Cursor(api.followers_ids, screen_name=i).pages():
ids[i]=page
time.sleep(60)
The values in the dictionary ids form the network I would like to analyze. If I try to get the complete list of followers for each id (to compare to the list of users in the network) I run into two problems.
The first is that I may not have permission to see the user's followers - that's okay and I can skip those - but they stop my program. This is the case with the following code:
connections = {}
for x in user_ids:
l=[]
for page in tweepy.Cursor(api.followers_ids, user_id=x).pages():
l.append(page)
connections[x]=l
The second is that I have no way of telling when my program will need to sleep to avoid the rate-limit. If I put a 60 second wait after every page in this query - my program would take too long to run.
I tried to find a simple 'exists_friendship' command that might get around these issues in a simpler way - but I only find things that became obsolete with the change to API 1.1. I am open to using other packages for Python. Thanks.
if api.exists_friendship(userid_a, userid_b):
print "a follows b"
else:
print "a doesn't follow b, check separately if b follows a"

Most Efficient Way of Iterating Over Large (20,000+ items) List Python

I have a set of Twitter data that I've accessed using the Tweepy Python library. However, I'm quickly realizing I haven't collected all the necessary data. What I'm doing now is extracting Tweet IDs from this uncleaned dataset, storing them in a list, and then iterating over this list to send each Tweet ID as a query to Twitter's API. I'd like to append each returned Twitter status as a JSON object/Python dict to a list. I'd like to then write these out to a flat file or MongoDB (assuming I can learn the latter in a timely fashion). I've been trying with something like the following code:
long_list = [id1, id2, id3, id4 .... id20000]
status_list = []
for i in long_list:
try:
tweet = api.get_status(i)
status_list.append(tweet._payload)
except:
pass
However, the above code seems to be timing out and my Python interpreter becomes unresponsive almost immediately after this is executed. I'm thinking there has to be a more efficient way to do this, but I have no idea what that might be. Any help would be immensely appreciated.
Are you sure it's your list code that get it slow, not the API you called? to check performance, you can try profile it, and there are graphical tools like runsnakerun
One simple way to check if your code is in middle of something is to print something,
like:
print 'getting status from tweet'
tweet = api.get_status(i)
print 'appending to my list'
status_list.append(tweet._payload)
Or you can use the logging module that handles nice things for you.
import logging
FORMAT = '%(asctime)-15s %(clientip)s %(user)-8s %(message)s'
logging.basicConfig(format=FORMAT)
logging.info('getting status from tweet')

Categories

Resources