Flask template streaming with Jinja

Flask template streaming with Jinja - python

I have a flask application. On a particular view, I show a table with about 100k rows in total. It's understandably taking a long time for the page to load, and I'm looking for ways to improve it. So far I've determined I query the database and get a result fairly quickly. I think the problem lies in rendering the actual page. I've found this page on streaming and am trying to work with that, but keep running into problems. I've tried the stream_template solution provided there with this code:
#app.route('/thing/matches', methods = ['GET', 'POST'])
#roles_accepted('admin', 'team')
def r_matches():
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).all()
return Response(stream_template('/retailer/matches.html',
dashboard_title = g.name,
match_show_option = True,
match_form = form,
matches = matches))
def stream_template(template_name, **context):
app.update_template_context(context)
t = app.jinja_env.get_template(template_name)
rv = t.stream(context)
rv.enable_buffering(5)
return rv
The Match query is the one that returns 100k+ items. However, whenever I run this the page just shows up blank with nothing there. I've also tried the solution with streaming the data to a json and loading it via ajax, but nothing seems to be in the json file either! Here's what that solution looks like:
#app.route('/large.json')
def generate_large_json():
def generate():
app.logger.info("Generating JSON")
matches = Product.query.join(Category).filter(
Product.retailer == g.retailer,
Product.match != None).order_by(Product.name)
for match in matches:
yield json.dumps(match)
app.logger.info("Sending file response")
return Response(stream_with_context(generate()))
Another solution I was looking at was for pagination. This solution works well, except I need to be able to sort through the entire dataset by headers, and couldn't find a way to do that without rendering the whole dataset in the table then using JQuery for sorting/pagination.
The file I get by going to /large.json is always empty. Please help or recommend another way to display such a large data set!
Edit: I got the generate() part to work and updated the code.

The problem in both cases is almost certainly that you are hanging on building 100K+ Match items and storing them in memory. You will want to stream the results from the DB as well using yield_per. However, only PostGres+psycopg2 support the necessary stream_result argument (here's a way to do it with MySQL):
matches = Match.query.filter(
Match.name == g.name).order_by(Match.name).yield_per(10)
# Stream ten results at a time
An alternative
If you are using Flask-SQLAlchemy you can make use of its Pagination class to paginate your query server-side and not load all 100K+ entries into the browser. This has the added advantage of not requiring the browser to manage all of the DOM entries (assuming you are doing the HTML streaming option).
See also
SQLAlchemy: Scan huge tables using ORM?
How to Use SQLAlchemy Magic to Cut Peak Memory and Server Costs in Half

Related

Is there a way to improve the load of nosql db from Streamlit app?

I'm writing a streamlit app where I read data from a nosql database
I need to find all value of my collection of data (26000)
I do this by a pymongo query when I run the app.
Very well.... I've tried this with a 2000+ rows and I have no problems, when I change collection such I said above, streamlit re-run in loop and web socket raise an error.
Is there a way to get a huge data from database and represent it on streamlit app as a dataframe without raise exception?
I have increase the size of the message by modifying the config setting for server.maxMessageSize but still dont work

It might be helpful if you could include at least the error message itself to help narrow down the specific issue. From what I can tell, though, it seems like the overarching issue is that you're trying to send more data to Streamlit than is permitted by its max message size value. I'm assuming "I have increase the size of the message" means that you've tried modifying the config setting for server.maxMessageSize per the Streamlit configuration guide, but if that is not the case then you should give that a try.
As you've got a huge amount of data, it might be best to give some form of pagination a try. That is: instead of sending the entire dataframe to Streamlit, you would instead send a section of it (a "page") and provide some way of switching between "pages", such as in the form of "previous page"/"next page" buttons. As an example, assuming you're loading your data in as a Pandas dataframe and have Streamlit imported as st, you could try encapsulating pagination into a function like this:
def paginate_dataframe(df, page_size: int):
# Determine the maximum number of pages we should experience
from math import ceil
n_pages = ceil(len(df)/page_size)
last_page = n_pages - 1
# Get current page, initializing the state of the value if necessary
current_page = st.session_state.setdefault("current_page", 0)
# Define callback functions for the previous/next buttons so they can adjust the current page
def decrement_page():
if st.session_state.current_page > 0:
st.session_state.current_page -= 1
def increment_page():
if st.session_state.current_page < last_page:
st.session_state.current_page += 1
# Create previous/next buttons which will be disabled when on the first or last pages, respectively
button_columns = st.columns(2)
with button_columns[0]:
st.button(
label="previous page",
on_click=decrement_page,
disabled=(current_page <= 0),
)
with button_columns[1]:
st.button(
label="next page",
on_click=increment_page,
disabled=(current_page >= last_page),
)
page_start = current_page * page_size
st.dataframe(df.iloc[page_start: page_start+page_size])
paginate_dataframe(df, page_size=5)

Why does ArangoDB (using Python-Arango) return ERR 1600 ERROR_CURSOR_NOT_FOUND?

The problem
I iterate over an entire vertex collection, e.g. journals, and use it to create edges, author, from a person to the given journal.
I use python-arango and the code is something like:
for journal in journals.all():
create_author_edge(journal)
I have a relatively small dataset, and the journals-collection has only ca. 1300 documents. However: this is more than 1000, which is the batch size in the Web Interface - but I don't know if this is of relevance.
The problem is that it raises a CursorNextError, and returns HTTP 404 and ERR 1600 from the database, which is the ERROR_CURSOR_NOT_FOUND error:
Will be raised when a cursor is requested via its id but a cursor with that id cannot be found.
Insights to the cause
From ArangoDB Cursor Timeout, and from this issue, I suspect that it's because the cursor's TTL has expired in the database, and in the python stacktrace something like this is seen:
# Part of the stacktrace in the error:
(...)
if not cursor.has_more():
raise StopIteration
cursor.fetch() <---- error raised here
(...)
If I iterate over the entire collection fast, i.e. if I do print(len(journals.all()) it outputs "1361" with no errors.
When I replace the journals.all() with AQL, and increase the TTL parameter, it works without errors:
for journal in db.aql.execute("FOR j IN journals RETURN j", ttl=3600):
create_author_edge(journal)
However, without the the ttl-parameter, the AQL approach gives the same error as using journals.all().
More information
A last piece of information is that I'm running this on my personal laptop when the error is raised. On my work computer, the same code was used to create the graph and populate it with the same data, but there no errors were raised. Because I'm on holiday I don't have access to my work computer to compare versions, but both systems were installed during the summer so there's a big chance the versions are the same.
The question
I don't know if this is an issue with python-arango, or with ArangoDB. I believe that because there is no problem when TTL is increased that it could indicate an issue with ArangodDB and not the Python driver, but I cannot know.
(I've added a feature request to add ttl-param to the .all()-method here.)
Any insights into why this is happening?
I don't have the rep to create the tag "python-arango", so it would be great if someone would create it and tag my question.

Inside of the server the simple queries will be translated to all().
As discussed on the referenced github issue, simple queries don't support the TTL parameter, and won't get them.
The prefered solution here is to use an AQL-Query on the client, so that you can specify the TTL parameter.
In general you should refrain from pulling all documents from the database at once, since this may introduce other scaling issues. You should use proper AQL with FILTER statements backed by indices (use explain() to revalidate) to fetch the documents you require.
If you need to iterate over all documents in the database, use paging. This is usually implemented the best way by combining a range FILTER with a LIMIT clause:
FOR x IN docs
FILTER x.offsetteableAttribute > #lastDocumentWithThisID
LIMIT 200
RETURN x

So here is how I did it. You can specify with the more args param makes it easy to do.
Looking at the source you can see the doc string says what to do
def AQLQuery(self, query, batchSize = 100, rawResults = False, bindVars = None, options = None, count = False, fullCount = False,
json_encoder = None, **moreArgs):
"""Set rawResults = True if you want the query to return dictionnaries instead of Document objects.
You can use **moreArgs to pass more arguments supported by the api, such as ttl=60 (time to live)"""
from pyArango.connection import *
conn = Connection(username=usr, password=pwd,arangoURL=url)# set this how ya need
db = conn['collectionName']#set this to the name of your collection
aql = """ for journal in journals.all():
create_author_edge(journal)"""
doc = db.AQLQuery(aql,ttl=300)
Thats all ya need to do!

How to disable query cache?

First of all, sorry for not 100% clearly questions title.
It is easier to explain with few lines of code:
query = {...}
while True:
elastic_response = elastic_client.search(elastic_index, body=query, request_cache=False)
if elastic_response["hits"]["total"]) == 0:
break
else:
for doc in elastic_response["hits"]["hits"]:
print("delete {}".format(doc["_id"]))
elastic_client.delete(index=elastic_index, doc_type=doc["_type"], id=doc["_id"])
I make a search, then delete all the docs and then do the search again to get the next bunch.
BUT the search query gives me the same docs! And this results in 404 exception on delete. It has to be some kind of cache, but i does not found anything, "request_cache" doesn't help.
I can probably refactor this code to use batch delete, but i want to understand what is wrong here
P.S. i'm using the official python client

If using a sleep() after the deletes makes the documents go away, then it's not about cache. It's about the refresh_interval and the near real timeness or Elasticsearch.
So, call _refresh after your code leaves the for loop. Also, don't delete document by document, but create a _bulk request where you delete all your documents in batches, depending on how many they are.

Unable to iteratively call yahoo's term extractor api using python

I am trying to loop through some 50-odd files in a directory. Each file has some text for which i am trying to find the keywords using Yahoo Term Extractor. I am able to extract text from each file, but I am not able to iteratively call the API using the text as input. Only the keywords for the first file is displayed.
Here is my code snippet:
in 'comments' list, I have extracted and stored the text from each file.
for c in comments:
print "building query"
dataDict = [ ('appid', appid), ('context', c)]
queryData = urllib.urlencode(dataDict)
request.add_data(queryData)
print "fetching result"
result = OPENER.open(request).read()
print result
time.sleep(1)

Well I don't know anything about the Yahoo Term Extractor, but I'd presume that your call request.add_data(queryData) simply tacks on another data set with each iteration of your loop. And then the call to OPENER.open(request).read() would probably only process the results of the first data set. So either your request object can only hold one query, or your OPENER object's inner workings can only process one query, it's as simple as that.
Actually a third reason comes to mind now that I read the documentation provided at your link, and this is probably the true one:
RATE LIMITS
The Term Extraction service is limited to 5,000 queries per IP address per day and to noncommercial use. See information on rate limiting.
So it would make sense that the API would limit your usage to one query at a time, and not allow you to flood a bunch of queries in a single request.
In any event, I'd assume you could fix your problem in a "naive" way by having many request variables instead of just one, or maybe just creating a new request with every iteration of your loop. If you're not worried about storing your results, and just trying to debug, you could try:
for c in comments:
print "building query"
dataDict = [ ('appid', appid), ('context', c)]
queryData = urllib.urlencode(dataDict)
request = urllib2.Request() # I don't know how to initialize this variable, do it yourself
request.add_data(queryData)
print "fetching result"
result = OPENER.open(request).read()
print result
time.sleep(1)
Again, I don't know about the Yahoo Term Extractor (nor do I really have time to research it) so there may very well be a better, more native way to do this. If you post more details of your code (i.e. what classes are the request and OPENER objects coming from) then I might be able to elaborate on this.

How to implement full text search in Django?

I would like to implement a search function in a django blogging application. The status quo is that I have a list of strings supplied by the user and the queryset is narrowed down by each string to include only those objects that match the string.
See:
if request.method == "POST":
form = SearchForm(request.POST)
if form.is_valid():
posts = Post.objects.all()
for string in form.cleaned_data['query'].split():
posts = posts.filter(
Q(title__icontains=string) |
Q(text__icontains=string) |
Q(tags__name__exact=string)
)
return archive_index(request, queryset=posts, date_field='date')
Now, what if I didn't want do concatenate each word that is searched for by a logical AND but with a logical OR? How would I do that? Is there a way to do that with Django's own Queryset methods or does one have to fall back to raw SQL queries?
In general, is it a proper solution to do full text search like this or would you recommend using a search engine like Solr, Whoosh or Xapian. What are their benefits?

I suggest you to adopt a search engine.
We've used Haystack search, a modular search application for django supporting many search engines (Solr, Xapian, Whoosh, etc...)
Advantages:
Faster
perform search queries even without querying the database.
Highlight searched terms
"More like this" functionality
Spelling suggestions
Better ranking
etc...
Disadvantages:
Search Indexes can grow in size pretty fast
One of the best search engines (Solr) run as a Java servlet (Xapian does not)
We're pretty happy with this solution and it's pretty easy to implement.

Actually, the query you have posted does use OR rather than AND - you're using \ to separate the Q objects. AND would be &.
In general, I would highly recommend using a proper search engine. We have had good success with Haystack on top of Solr - Haystack manages all the Solr configuration, and exposes a nice API very similar to Django's own ORM.

Answer to your general question: Definitely use a proper application for this.
With your query, you always examine the whole content of the fields (title, text, tags). You gain no benefit from indexes, etc.
With a proper full text search engine (or whatever you call it), text (words) is (are) indexed every time you insert new records. So queries will be a lot faster especially when your database grows.

SOLR is very easy to setup and integrate with Django. Haystack makes it even simpler.

For full text search in Python, look at PyLucene. It allows for very complex queries. The main problem here is that you must find a way to tell your search engine which pages changed and update the index eventually.
Alternatively, you can use Google Sitemaps to tell Google to index your site faster and then embed a custom query field in your site. The advantage here is that you just need to tell Google the changed pages and Google will do all the hard work (indexing, parsing the queries, etc). On top of that, most people are used to use Google to search plus it will keep your site current in the global Google searches, too.

I think full text search on an application level is more a matter of what you have and how you expect it to scale. If you run a small site with low usage I think it might be more affordable to put some time into making an custom full text search rather than installing an application to perform the search for you. And application would create more dependency, maintenance and extra effort when storing data. By making your search yourself and you can build in nice custom features. Like for example, if your text exactly matches one title you can direct the user to that page instead of showing the results. Another would be to allow title: or author: prefixes to keywords.
Here is a method I've used for generating relevant search results from a web query.
import shlex
class WeightedGroup:
def __init__(self):
# using a dictionary will make the results not paginate
# but it will be a lot faster when storing data
self.data = {}
def list(self, max_len=0):
# returns a sorted list of the items with heaviest weight first
res = []
while len(self.data) != 0:
nominated_weight = 0
for item, weight in self.data.iteritems():
if weight > nominated_weight:
nominated = item
nominated_weight = weight
self.data.pop(nominated)
res.append(nominated)
if len(res) == max_len:
return res
return res
def append(self, weight, item):
if item in self.data:
self.data[item] += weight
else:
self.data[item] = weight
def search(searchtext):
candidates = WeightedGroup()
for arg in shlex.split(searchtext): # shlex understand quotes
# Search TITLE
# order by date so we get most recent posts
query = Post.objects.filter_by(title__icontains=arg).order_by('-date')
arg_hits = query.count() # count is cheap
if arg_hits > 1000:
continue # skip keywords which has too many hits
# Each of these are expensive as it would transfer data
# from the db and build a python object,
for post in query[:50]: # so we limit it to 50 for example
# more hits a keyword has the lesser it's relevant
candidates.append(100.0 / arg_hits, post.post_id)
# TODO add searchs for other areas
# Weight might also be adjusted with number of hits within the text
# or perhaps you can find other metrics to value an post higher,
# like number of views
# candidates can contain a lot of stuff now, show most relevant only
sorted_result = Post.objects.filter_by(post_id__in=candidates.list(20))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.