How to prevent repeated processing in Flask - python

I have a site in production using: Flask, Gunicorn, MongoEngine and I'm facing duplicate data processing error due to too many requests per second on certain routes.
Example code:
#app.route('/proccess_data/<u_id>')
def procces_dt_(u_id):
check_exists = db_collection.find_one({'u_id': u_id})
if not check_exists:
db_collection.insert_one({
'u_id': u_id,
'key': 'value'
})
return 'OK'
else:
return 'has already been processed', 403
See this check I do in the database before inserting a new document. This seems correct to me.
But I tested a few requests per second and the result is multiple identical documents inserted.
Making it clear that for me it would not be useful just to create a unique index for a field: u_id
I've already tested some things I researched, but nothing works and I also didn't find many things related to it.

Related

Imgur API - How do I retrieve all favorites without pagination?

According to the Imgur Docs, the "GET Account Favorites" API call takes optional arguments for pagination, implying that all objects are returned without it.
However, when I use the following code snippet (the application has been registered and OAuth has already performed against my account for testing), I get only the first 30 JSON objects. In the snippet below, I already have an access_token for an authorized user and can retrieve data for that username. But the length of the returned list is always the first 30 items.
username = token['username']
bearer_headers = {
'Authorization': 'Bearer ' + token['access_token']
}
fav_url = 'https://api.imgur.com/3/account/' + username + '/' + 'favorites'
r = requests.get(fav_url, headers=bearer_headers)
r_json = r.json()
favorites=r_json['data']
len(favorites)
print(favorites)
The requests response returns a dictionary with three keys: status (the HTTP status code), success (true or false), and data, of which the value is a list of dictionaries (one per favorited item).
I'm trying to retrieve this without pagination so I can extract specific metadata values into a Pandas dataframe (id, post date, etc).
I originally thought this was a Pandas display problem in Jupyter notebook, but tracked it back to the API only returning the newest 30 list items, despite the docs indicating otherwise. If I place an arbitrary page number at the end (eg, "/favorites/1"), it returns the 30 items appropriate to that page, but there doesn't seem to be an option to get all items or retrieve a count of the total items or number of pages in advance.
What am I missing?
Postscript: It appears that none of the URIs work without pagination, eg, get account images, get gallery submissions, etc. Anything where there is an optional "/{{page}}" parameter, it will default to first page if none is specified. So I guess the larger question is, "does Imgur API even support non-paginated data, and how is that accessed?".
Paginated data is usually used when the possible size of the response can be arbitrarily large. I would be surprised if a major service like Imgur had an API that didn't work this way.
As you have found, the page attribute may be optional, and if you don't provide it, you get the first page as your response.
If you want to get more than the first page, you will need to loop over the page number:
data = []
page = 0
while block := connection.get(page=page):
data.append(block)
page += 1
This assumes Python3.8+ due to the := assignment expression. If you are on an older version you'll need to set block in the loop body, but the same idea applies.

Elastic search DSL python issue

I have been using the ElasticSearch DSL python package to query my elastic search database. The querying method is very intuitive but I'm having issues retrieving the documents. This is what I have tried:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
es = Elasticsearch(hosts=[{"host":'xyz', "port":9200}],timeout=400)
s = Search(using=es,index ="xyz-*").query("match_all")
response = s.execute()
for hit in response:
print hit.title
The error I get :
AttributeError: 'Hit' object has no attribute 'title'
I googled the error and found another SO : How to access the response object using elasticsearch DSL for python
The solution mentions:
for hit in response:
print hit.doc.firstColumnName
Unfortunately, I had the same issue again with 'doc'. I was wondering what the correct way to access my document was?
Any help would really be appreciated!
I'm running into the same issues as I've found different versions of this, but it seems to depend on the version of the elasticsearch-dsl library you're using. You might explore the response object, and it's sub-objects. For instance, using version 5.3.0, I see the expected data using the below loop.
for hit in RESPONSE.hits._l_:
print(hit)
or
for hit in RESPONSE.hits.hits:
print(hit)
NOTE these are limited to 10 data elements for some strange reason.
print(len(RESPONSE.hits.hits))
10
print(len(RESPONSE.hits._l_))
10
This doesn't match the amount of overall hits if I print the number of hits using print('Total {} hits found.\n'.format(RESPONSE.hits.total))
Good luck!
From version 6 onwards the response does not return your populated Document class anymore, meaning that your fields are just an AttrDict which is basically a dictionary.
To solve this you need to have a Document class representing the document you want to parse. Then you need to parse the hit dictionary with your document class using the .from_es() method.
Like I answered here.
https://stackoverflow.com/a/64169419/5144029
Also have a look at the Document class here
https://elasticsearch-dsl.readthedocs.io/en/7.3.0/persistence.html

How to disable query cache?

First of all, sorry for not 100% clearly questions title.
It is easier to explain with few lines of code:
query = {...}
while True:
elastic_response = elastic_client.search(elastic_index, body=query, request_cache=False)
if elastic_response["hits"]["total"]) == 0:
break
else:
for doc in elastic_response["hits"]["hits"]:
print("delete {}".format(doc["_id"]))
elastic_client.delete(index=elastic_index, doc_type=doc["_type"], id=doc["_id"])
I make a search, then delete all the docs and then do the search again to get the next bunch.
BUT the search query gives me the same docs! And this results in 404 exception on delete. It has to be some kind of cache, but i does not found anything, "request_cache" doesn't help.
I can probably refactor this code to use batch delete, but i want to understand what is wrong here
P.S. i'm using the official python client
If using a sleep() after the deletes makes the documents go away, then it's not about cache. It's about the refresh_interval and the near real timeness or Elasticsearch.
So, call _refresh after your code leaves the for loop. Also, don't delete document by document, but create a _bulk request where you delete all your documents in batches, depending on how many they are.

Python Requests POST to form records incorrect payload (checkboxes)

I am having quite a bit of trouble with getting the correct form data saved to a server via POST with Requests (2.8.1) module.
I have previous code which does exactly what I want it to do: it encodes a bunch of key:value pairs into the correct header:value payload dict format, and successfully POSTS to the URI. I get a 200 response (what I'm looking for) and everything is great.
This is a section of the OLD payload encoding function, with a ton of key:value pairs omitted for brevity.
Note: the checkbox value set could be any sequence of numbers between 1 and 25, I just wrote it as
item in range(1,5)
to illustrate that the list is comprised of int numbers, i.e. [ "", 1, 2, 3, 4, 5,...] or [ "", 2, 7, 5, 1, 25,...] etc.
checkboxList = ["",]
for item in range(1,5):
checkboxList.append(item)
payload['checkbox[ids][]'] = checkboxList
...
response = request.post(data_url, data=payload)
>> 200 OK!
Here is a print of what the payload dict (checkboxes) looks like before it's sent to the server:
{... "checkbox[ids][]" : [ "", 2, 17, 20, 5], ...}
And when I look on the page with a browser, all the payload information has been correctly recorded (omitted above) AND the checkboxes (shown above) are correct!
Originally, the checkbox values came from an excel file, as did the rest of the information that was put into the payload before being POSTed to the server. However, now I'm retrieving the information from an SQLite db.
Below is the NEW code that records the checkboxes incorrectly. I should note: I do not have access to the server, so I cannot easily tell if it's a server issue, but let's assume it's not the servers fault. I've had this issue previously, but I got it to work with the above code. However, now that I've started to store the values I need in a db, I cannot get the correct checkboxes recorded by the server.
This is what the data from the db column looks like:
12-5-1-22-4
(... I know this isn't great practice for DB mgmt, but I assume this isn't why the POST is recording the wrong data, and I wanted this question to be as closely representational to my code as possible.)
checkList = checkboxesFromDB.split('-')
payload['checkbox[ids][]'] = checkList
...
response = request.post(data_url, data=payload)
>> 200 OK!
When I look at the site with the browser, it records the checkboxes incorrectly. Now, i should note that 3 checkboxes are selected no matter what I pass to payload[checkbox[ids][]]
It's ALWAYS the same 3, incorrect checkboxes, even if I completely omit checkbox[ids][] from the payload dict. Knowing that, we could assume its a server issue. However, the nearly EXACT code from above works (when I grab the info from an excel file).
I've tried the following (with only one value as a test) without getting the correct checkboxes recorded by the server:
payload['checkbox[ids][]'] = '1'
payload['checkbox[ids][]'] = 1
payload['checkbox[ids][]'] = [1]
payload['checkbox[ids][]'] = ["",1]
payload['checkbox[ids][]'] = [1,""]
When uploading images to the same server, I had an encoding issue when retrieving the image BLOB from the db and trying to pass the buffer object directly to Requests as a file, but I fixed this with cStringIO encoding. (It took me forever as I'm really new to programming, and still unsure of syntax, let alone ways to handle this sort of stuff....) I thought I might be having a similar encoding issue, but with the testing and research I've done, I cannot determine either way as I feel like I'm a bit over my head.
I apologize if this is completely NOOB, but I've done extensive research, trying so many different things that I could think of. I tried passing strings, lists, dicts, forcing encoding of lists as utf-8.
The main reason I'm so perplexed is my original code WORKS, and my new code is nearly identical but doesn't. The only real difference I can think of is now my information is coming from a SQLite db (this particular checkbox column is TEXT type)
Can anyone help me, or point me in a new direction I haven't thought of/know of?
I went through all payload pairs to find that it was an issue with HTML.
I was saving HTML in my SQLite db (via BeautifulSoup without prettifying it) as TEXT. Then I was retrieving it and sending it as a string. This was throwing off the server response.
I have since swapped that sql column value type to VARCHAR (as is best for my use) and prettify it like this foo = bar.prettify(formatter="html")before saving to the db. Now, when i retrieve the value and pass it to the payload, everything works as it should.

Flask template streaming with Jinja

I have a flask application. On a particular view, I show a table with about 100k rows in total. It's understandably taking a long time for the page to load, and I'm looking for ways to improve it. So far I've determined I query the database and get a result fairly quickly. I think the problem lies in rendering the actual page. I've found this page on streaming and am trying to work with that, but keep running into problems. I've tried the stream_template solution provided there with this code:
#app.route('/thing/matches', methods = ['GET', 'POST'])
#roles_accepted('admin', 'team')
def r_matches():
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).all()
return Response(stream_template('/retailer/matches.html',
dashboard_title = g.name,
match_show_option = True,
match_form = form,
matches = matches))
def stream_template(template_name, **context):
app.update_template_context(context)
t = app.jinja_env.get_template(template_name)
rv = t.stream(context)
rv.enable_buffering(5)
return rv
The Match query is the one that returns 100k+ items. However, whenever I run this the page just shows up blank with nothing there. I've also tried the solution with streaming the data to a json and loading it via ajax, but nothing seems to be in the json file either! Here's what that solution looks like:
#app.route('/large.json')
def generate_large_json():
def generate():
app.logger.info("Generating JSON")
matches = Product.query.join(Category).filter(
Product.retailer == g.retailer,
Product.match != None).order_by(Product.name)
for match in matches:
yield json.dumps(match)
app.logger.info("Sending file response")
return Response(stream_with_context(generate()))
Another solution I was looking at was for pagination. This solution works well, except I need to be able to sort through the entire dataset by headers, and couldn't find a way to do that without rendering the whole dataset in the table then using JQuery for sorting/pagination.
The file I get by going to /large.json is always empty. Please help or recommend another way to display such a large data set!
Edit: I got the generate() part to work and updated the code.
The problem in both cases is almost certainly that you are hanging on building 100K+ Match items and storing them in memory. You will want to stream the results from the DB as well using yield_per. However, only PostGres+psycopg2 support the necessary stream_result argument (here's a way to do it with MySQL):
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).yield_per(10)
# Stream ten results at a time
An alternative
If you are using Flask-SQLAlchemy you can make use of its Pagination class to paginate your query server-side and not load all 100K+ entries into the browser. This has the added advantage of not requiring the browser to manage all of the DOM entries (assuming you are doing the HTML streaming option).
See also
SQLAlchemy: Scan huge tables using ORM?
How to Use SQLAlchemy Magic to Cut Peak Memory and Server Costs in Half

Categories

Resources