failing to iterate over list for querying DynamoDB - python

I am trying to query a DynamoDB iterating over a list and it is failing. I mean returning empty JSON. If I run query with single id, I am able to get data.
I am getting data from a file to a list.
Below is my code in loop:
with open('file.txt') as f:
resid = f.read().splitlines()
for id in resid:
result = table.query(
IndexName="partner_resid-index",
KeyConditionExpression=Key("id").eq(partner_resid[0]),
FilterExpression=Key("event").eq("active"),
)
print(result)
Even I tried to call using a function, but no luck.
Any suggestions what I am missing here?

The boto3 query function returns only a single page of query results. You must check whether this result has a LastEvaluatedKey and if it does, send another query, with ExclusiveStartKey set to the last LastEvaluatedKey, and continue to do that until you get the last page, without LastEvaluatedKey set.
The thing is, if your FilterExpression filters out a lot of results, you may even get an empty page - and it is possible this is the empty result you're seeing. Note that DynamoDB first reads a page full of data (by default, 1MB of data), and only then applies to it the FilterExpression. It is possible to get back an empty page if all those results did not match the filter, and you still need to continue the loop to the next page.
Alternatively, you can use boto3's paginator mechanism. It is used like:
got_items = []
paginator = dynamodb.meta.client.get_paginator('query')
for page in paginator.paginate(TableName='name', KeyConditionExpression=...):
got_items += page['Items']

Related

how to get nested data with pandas and request

I'm going crazy trying to get data through an API call using request and pandas. It looks like it's nested data, but I cant get the data i need.
https://xorosoft.docs.apiary.io/#reference/sales-orders/get-sales-orders
above is the api documentation. I'm just trying to keep it simple and get the itemnumber and qtyremainingtoship, but i cant even figure out how to access the nested data. I'm trying to use DataFrame to get it, but am just lost. any help would be appreciated. i keep getting stuck at the 'Data' level.
type(json['Data'])
df = pd.DataFrame(['Data'])
df.explode('SoEstimateHeader')
df.explode('SoEstimateHeader')
Cell In [64], line 1
df.explode([0:])
^
SyntaxError: invalid syntax
I used the link to grab a sample response from the API documentation page you provided. From the code you provided it looks like you are already able to get the data and I'm assuming the you have it as a dictionary type already.
From what I can tell I don't think you should be using pandas, unless its some downstream requirement in the task you are doing. But to get the ItemNumber & QtyRemainingToShip you can use the code below.
# get the interesting part of the data out of the api response
data_list = json['Data']
#the data_list is only one element long, so grab the first element which is of type dictionary
data = data_list[0]
# the dictionary has two keys at the top level
so_estimate_header = data['SoEstimateHeader']
# similar to the data list the value associated with "SoEstimateItemLineArr" is of type list and has 1 element in it, so we grab the first & only element.
so_estimate_item_line_arr = data['SoEstimateItemLineArr'][0]
# now we can grab the pieces of information we're interested in out of the dictionary
qtyremainingtoship = so_estimate_item_line_arr["QtyRemainingToShip"]
itemnumber = so_estimate_item_line_arr["ItemNumber"]
print("QtyRemainingToShip: ", qtyremainingtoship)
print("ItemNumber: ", itemnumber)
Output
QtyRemainingToShip: 1
ItemNumber: BC
Side Note
As a side note I wouldn't name any variables json because thats also the name of a popular library in python for parsing json, so that will be confusing to future readers and will clash with the name if you end up having to import the json library.

Imgur API - How do I retrieve all favorites without pagination?

According to the Imgur Docs, the "GET Account Favorites" API call takes optional arguments for pagination, implying that all objects are returned without it.
However, when I use the following code snippet (the application has been registered and OAuth has already performed against my account for testing), I get only the first 30 JSON objects. In the snippet below, I already have an access_token for an authorized user and can retrieve data for that username. But the length of the returned list is always the first 30 items.
username = token['username']
bearer_headers = {
'Authorization': 'Bearer ' + token['access_token']
}
fav_url = 'https://api.imgur.com/3/account/' + username + '/' + 'favorites'
r = requests.get(fav_url, headers=bearer_headers)
r_json = r.json()
favorites=r_json['data']
len(favorites)
print(favorites)
The requests response returns a dictionary with three keys: status (the HTTP status code), success (true or false), and data, of which the value is a list of dictionaries (one per favorited item).
I'm trying to retrieve this without pagination so I can extract specific metadata values into a Pandas dataframe (id, post date, etc).
I originally thought this was a Pandas display problem in Jupyter notebook, but tracked it back to the API only returning the newest 30 list items, despite the docs indicating otherwise. If I place an arbitrary page number at the end (eg, "/favorites/1"), it returns the 30 items appropriate to that page, but there doesn't seem to be an option to get all items or retrieve a count of the total items or number of pages in advance.
What am I missing?
Postscript: It appears that none of the URIs work without pagination, eg, get account images, get gallery submissions, etc. Anything where there is an optional "/{{page}}" parameter, it will default to first page if none is specified. So I guess the larger question is, "does Imgur API even support non-paginated data, and how is that accessed?".
Paginated data is usually used when the possible size of the response can be arbitrarily large. I would be surprised if a major service like Imgur had an API that didn't work this way.
As you have found, the page attribute may be optional, and if you don't provide it, you get the first page as your response.
If you want to get more than the first page, you will need to loop over the page number:
data = []
page = 0
while block := connection.get(page=page):
data.append(block)
page += 1
This assumes Python3.8+ due to the := assignment expression. If you are on an older version you'll need to set block in the loop body, but the same idea applies.

Cloudsearch Request Exceed 10,000 Limit

When I search a query that has more than 10,000 matches I get the following error:
{u'message': u'Request depth (10100) exceeded, limit=10000', u'__type': u'#SearchException', u'error': {u'rid': u'zpXDxukp4bEFCiGqeQ==', u'message': u'[*Deprecated*: Use the outer message field] Request depth (10100) exceeded, limit=10000'}}
When I search for more narrowed down keywords and queries with less results, everything works fine and no error is returned.
I guess I have to limit the search somehow, but I'm unable to figure out how. My search function looks like this:
def execute_query_string(self, query_string):
amazon_query = self.search_connection.build_query(q=query_string, start=0, size=100)
json_search_results = []
for json_blog in self.search_connection.get_all_hits(amazon_query):
json_search_results.append(json_blog)
results = []
for json_blog in json_search_results:
results.append(json_blog['fields'])
return results
And it's being called like this:
results = searcher.execute_query_string(request.GET.get('q', ''))[:100]
As you can see, I've tried to limit the results with the start and size attributes of build_query(). I still get the error though.
I must have missunderstood how to avoid getting more than 10,000 matches on a search result. Can someone tell me how to do it?
All I can find on this topic is Amazon's Limits where it says that you can only request 10,000 results. It does not say how to limit it.
You're calling get_all_hits, which gets ALL results for your query. That is why your size param is being ignored.
From the docs:
get_all_hits(query) Get a generator to iterate over all search results
Transparently handles the results paging from Cloudsearch search
results so even if you have many thousands of results you can iterate
over all results in a reasonably efficient manner.
http://boto.readthedocs.org/en/latest/ref/cloudsearch2.html#boto.cloudsearch2.search.SearchConnection.get_all_hits
You should be calling search instead -- http://boto.readthedocs.org/en/latest/ref/cloudsearch2.html#boto.cloudsearch2.search.SearchConnection.search

Flask template streaming with Jinja

I have a flask application. On a particular view, I show a table with about 100k rows in total. It's understandably taking a long time for the page to load, and I'm looking for ways to improve it. So far I've determined I query the database and get a result fairly quickly. I think the problem lies in rendering the actual page. I've found this page on streaming and am trying to work with that, but keep running into problems. I've tried the stream_template solution provided there with this code:
#app.route('/thing/matches', methods = ['GET', 'POST'])
#roles_accepted('admin', 'team')
def r_matches():
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).all()
return Response(stream_template('/retailer/matches.html',
dashboard_title = g.name,
match_show_option = True,
match_form = form,
matches = matches))
def stream_template(template_name, **context):
app.update_template_context(context)
t = app.jinja_env.get_template(template_name)
rv = t.stream(context)
rv.enable_buffering(5)
return rv
The Match query is the one that returns 100k+ items. However, whenever I run this the page just shows up blank with nothing there. I've also tried the solution with streaming the data to a json and loading it via ajax, but nothing seems to be in the json file either! Here's what that solution looks like:
#app.route('/large.json')
def generate_large_json():
def generate():
app.logger.info("Generating JSON")
matches = Product.query.join(Category).filter(
Product.retailer == g.retailer,
Product.match != None).order_by(Product.name)
for match in matches:
yield json.dumps(match)
app.logger.info("Sending file response")
return Response(stream_with_context(generate()))
Another solution I was looking at was for pagination. This solution works well, except I need to be able to sort through the entire dataset by headers, and couldn't find a way to do that without rendering the whole dataset in the table then using JQuery for sorting/pagination.
The file I get by going to /large.json is always empty. Please help or recommend another way to display such a large data set!
Edit: I got the generate() part to work and updated the code.
The problem in both cases is almost certainly that you are hanging on building 100K+ Match items and storing them in memory. You will want to stream the results from the DB as well using yield_per. However, only PostGres+psycopg2 support the necessary stream_result argument (here's a way to do it with MySQL):
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).yield_per(10)
# Stream ten results at a time
An alternative
If you are using Flask-SQLAlchemy you can make use of its Pagination class to paginate your query server-side and not load all 100K+ entries into the browser. This has the added advantage of not requiring the browser to manage all of the DOM entries (assuming you are doing the HTML streaming option).
See also
SQLAlchemy: Scan huge tables using ORM?
How to Use SQLAlchemy Magic to Cut Peak Memory and Server Costs in Half

Unable to iteratively call yahoo's term extractor api using python

I am trying to loop through some 50-odd files in a directory. Each file has some text for which i am trying to find the keywords using Yahoo Term Extractor. I am able to extract text from each file, but I am not able to iteratively call the API using the text as input. Only the keywords for the first file is displayed.
Here is my code snippet:
in 'comments' list, I have extracted and stored the text from each file.
for c in comments:
print "building query"
dataDict = [ ('appid', appid), ('context', c)]
queryData = urllib.urlencode(dataDict)
request.add_data(queryData)
print "fetching result"
result = OPENER.open(request).read()
print result
time.sleep(1)
Well I don't know anything about the Yahoo Term Extractor, but I'd presume that your call request.add_data(queryData) simply tacks on another data set with each iteration of your loop. And then the call to OPENER.open(request).read() would probably only process the results of the first data set. So either your request object can only hold one query, or your OPENER object's inner workings can only process one query, it's as simple as that.
Actually a third reason comes to mind now that I read the documentation provided at your link, and this is probably the true one:
RATE LIMITS
The Term Extraction service is limited to 5,000 queries per IP address per day and to noncommercial use. See information on rate limiting.
So it would make sense that the API would limit your usage to one query at a time, and not allow you to flood a bunch of queries in a single request.
In any event, I'd assume you could fix your problem in a "naive" way by having many request variables instead of just one, or maybe just creating a new request with every iteration of your loop. If you're not worried about storing your results, and just trying to debug, you could try:
for c in comments:
print "building query"
dataDict = [ ('appid', appid), ('context', c)]
queryData = urllib.urlencode(dataDict)
request = urllib2.Request() # I don't know how to initialize this variable, do it yourself
request.add_data(queryData)
print "fetching result"
result = OPENER.open(request).read()
print result
time.sleep(1)
Again, I don't know about the Yahoo Term Extractor (nor do I really have time to research it) so there may very well be a better, more native way to do this. If you post more details of your code (i.e. what classes are the request and OPENER objects coming from) then I might be able to elaborate on this.

Categories

Resources