This is the following query that I have in my views.py:
#parse json function
parse = get_persistent_graph(request)
#guys
male_pic = parse.fql('SELECT name,uid,education FROM user WHERE sex="male" AND uid IN (SELECT uid2 FROM friend WHERE uid1 = me())')
This query currently takes approximately 10 seconds for me to load with about 800+ friends.
Is it possible to only query this once, update when needed and save to a variable to use instead of having to query every time the page is loaded/refreshed?
Some possible solutions that I can think of are:
Saving to the database - IMO that doesn't seem easily scaleable if I save every query from every user of this application
Some function I have no knowledge of
Improved and more efficient query request
Could someone please guide me to the path gate with the most efficient direction? Hoping I can lower 10 second requests to < 1 second! Thanks!
Use cache system for that (like memcache).
Related
I am trying to get the number of items in a table from dynamo db.
Code
def urlfn():
if request.method == 'GET':
print("GET REq processing")
return render_template('index.html',count = table.item_count)
But I am not getting the real count. I found that there is a 6 hour delay in getting the real count. Is there any way to get the real count of items in a table.
assuming in your code above that table is a service resource already defined, you can use:
len(table.scan())
this will give you an up to date count of items in your table. BUT it reads every single item in your table - for significantly large tables this can take a long time. AND it uses read capacity on your table to do so. So, for most practical purposes it really isn't a very good way to do so,
Depending on your use case here there are a few other options:
add a meta item that is updated everytime a new document is added to the dynamo. This is just document of whatever hash key / sort key combination you want with an attribute of "value" that you add 1 to every time you add a new item to the database.
you forget about using Dynamo. Sorry that sounds harsh, but DynamoDB is a no-sql database and attempting to use it in the same manner as a traditional relational database system is folly. # of 'rows' is not something that Dynamo is designed for because thats not its use case scope. There are no rows in Dynamo - there are documents, and those documents are partitioned, and you access small chunks of them at a time - meaning that the back end architecture does not lend itself for knowing what the entire system has in it at any given time (hence the 6 hour delay)
After i researched for three days and played with redis and celery i m no longer sure what the right solution to my problem is.
Its a simple problem. I have a simple flask app returning the data of a mysql query. But i dont want to query the database for every request made, as they might be 100 requests in a second. I wanna setup a daemon that queries independently my database every five seconds and if someone makes a request it should return the data of the previous request and when those 5 secs pass it will return the data from the latest query. All users recieve the same data. Is CELERY the solution?
i researched for three days.
The easiest way is to use Flask-Caching]
Just set a cache timeout of 5 seconds on your view and it will return a cached view response containing the result of the query made the first time and for all other query in the next 5 secs. When time is out, the first request will regenerate the cache by doing the query and all the flow of your view.
If your view function use arguments, use memoization instead of cache decorator to let caching use your arguments to generate the cache. For exemple, if you want to return a page details and you don't use memoization, you will return the same page detail for all your user, no matter of the id / slug in arguments.
The documentation of Flask-Caching explain everything better than me
I'm working on a project that allows users to enter SQL queries with parameters, that SQL query will be executed over a period of time they decide (say every 2 hours for 6 months) and then get the results back to their email address.
They'll get it in the form of an HTML-email message, so what the system basically does is run the queries, and generate HTML that is then sent to the user.
I also want to save those results, so that a user can go on our website and look at previous results.
My question is - what data do I save?
Do I save the SQL query with those parameters (i.e the date parameters, so he can see the results relevant to that specific date). This means that when the user clicks on this specific result, I need to execute the query again.
Save the HTML that was generated back then, and simply display it when the user wishes to see this result?
I'd appreciate it if somebody would explain the pros and cons of each solution, and which one is considered the best & the most efficient.
The archive will probably be 1-2 months old, and I can't really predict the amount of rows each query will return.
Thanks!
Specifically regarding retrieving the results from queries that have been run previously I would suggest saving the results to be able to view later rather than running the queries again and again. The main benefits of this approach are:
You save unnecessary computational work re-running the same queries;
You guarantee that the result set will be the same as the original report. For example if you save just the SQL then the records queried may have changed since the query was last run or records may have been added / deleted.
The disadvantage of this approach is that it will probably use more disk space, but this is unlikely to be an issue unless you have queries returning millions of rows (in which case html is probably not such a good idea anyway).
If I would create such type of application then
I will have some common queries like get by current date,current time , date ranges, time ranges, n others based on my application for the user to select easily.
Some autocompletions for common keywords.
If the data gets changed frequently there is no use saving html, generating new one is good option
The crucial difference is that if data changes, new query will return different result than what was saved some time ago, so you have to decide if the user should get the up to date data or a snapshot of what the data used to be.
If relevant data does not change, it's a matter of whether the queries will be expensive, how many users will run them and how often, then you may decide to save them instead of re-running queries, to improve performance.
I have one small webapp, which uses Pyhon/Flask and a MySQL db for storage of data. I have a studentsdatabase, which has around 3 thousand rows. When trying to load that page, the loading takes very much time, sometimes even a minute or so. Its around 20 seconds, which is really slow and I am wondering what is causing this. This is the state of the server before any request is made, and this happens when I try to load that site.
As I said, this is not too much records, and I am puzzled by why this is so ineffective. I am using Ubuntu 12.04, with Ver 14.14 Distrib 5.5.32, for debian-linux-gnu (x86_64) using readline 6.2 mysql version. Other queries run fine, for example listing students whose name starts with some letter takes around 2-3 seconds, which is acceptable. That shows the portion of the table, so I am guessing something is not optimized right.
My.cnf file is located here. I tried some stuff, added some lines at the bottom, but without too much success.
The actual queries are done by sqlalchemy, and this is the specific code used to load this:
score = db.session.query(Scores.id).order_by(Scores.date.desc()).correlate(Students).filter(Students.email == Scores.email).limit(1)
students = db.session.query(Students, score.as_scalar()).filter_by(archive=0).order_by(Students.exam_date)
return render_template("students.html", students=students.all())
This appears to be the sql generated:
SELECT student.id AS student_id, student.first_name AS student_first_name, student.middle_name AS student_middle_name, student.last_name AS student_last_name, student.email AS student_email, student.password AS student_password, student.address1 AS student_address1, student.address2 AS student_address2, student.city AS student_city, student.state AS student_state, student.zip AS student_zip, student.country AS student_country, student.phone AS student_phone, student.cell_phone AS student_cell_phone, student.active AS student_active, student.archive AS student_archive, student.imported AS student_imported, student.security_pin AS student_security_pin, (SELECT scores.id \nFROM scores \nWHERE student.email = scores.email ORDER BY scores.date DESC \n LIMIT 1) AS anon_1 \nFROM student \nWHERE student.archive = 0"
Thanks in advance for your time and help!
#datasage is right - the micro instance can only do so much. You might try starting a second micro-instance for your mysql database. Running both apache and mysql on a single micro instance will be slow.
From my experience, when using AWS's RDS service (mysql)- you can get reasonable performance on the micro-instance for testing. Depending on how long the instance has been on, sometimes you can get crawlers pinging your site, so it can help to IP restrict it to your computer in the security policy.
It doesn't look like your database structure is that complex - you might add an index on your email fields, but I suspect unless your dataset is over 5000 rows it won't make much difference. If you're using the sqlalchemy ORM, this would look like:
class Scores(base):
__tablename__ = 'center_master'
id = Column(Integer(), primary_key=True)
email = Column(String(255), index=True)
Micro instances are pretty slow performance wise. They are designed with burstable CPU profiles and will be heavily restricted when the burstable time is exceeded.
That said, your problem here is likely with your database design. Any time you want to join two tables, you want to have indexes on columns of the right and left side of the join. In this case you are using the email field.
Using strings to join on isn't quite as optimal as using an integer id. Also using the Explain keyword will running the query directly in mysql will show you an execution plan and can help you quickly identify where you may be missing indexes or have other problems.
How to do this on Google App Engine (Python):
SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW"
AND t >= start_time AND t <= end_time
Long version:
I have a Python Google App Engine application with users that generate events, such as pageviews. I would like to know in a given timespan how many unique users generated a pageview event. The timespan I am most interested in is one week, and there are about a million such events in a given week. I want to run this in a cron job.
My event entities look like this:
class Event(db.Model):
t = db.DateTimeProperty(auto_now_add=True)
user = db.StringProperty(required=True)
event_type = db.StringProperty(required=True)
With an SQL database, I would do something like
SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW"
AND t >= start_time AND t <= end_time
First thought that occurs is to get all PAGEVIEW events and filter out duplicate users. Something like:
query = Event.all()
query.filter("t >=", start_time)
query.filter("t <=", end_time)
usernames = []
for event in query:
usernames.append(event.user)
answer = len(set(usernames))
But this won't work, because it will only support up to 1000 events. Next thing that occurs to me is to get 1000 events, then when those run out get the next thousand and so on. But that won't work either, because going through a thousand queries and retrieving a million entities would take over 30 seconds, which is the request time limit.
Then I thought I should ORDER BY user to faster skip over duplicates. But that is not allowed because I am already using the inequality "t >= start_time AND t <= end_time".
It seems clear this cannot be accomplished under 30 seconds, so it needs to be fragmented. But finding distinct items seems like it doesn't split well into subtasks. Best I can think of is on every cron jobcall to find 1000 pageview events and then get distinct usernames from those, and put them in an entity like Chard. It could look something like
class Chard(db.Model):
usernames = db.StringListProperty(required=True)
So each chard would have up to 1000 usernames in it, less if there were duplicates that got removed. After about a 16 hours (which is fine) I would have all the chards and could do something like:
chards = Chard.all()
all_usernames = set()
for chard in chards:
all_usernames = all_usernames.union(chard.usernames)
answer = len(all_usernames)
It seems like it might work, but hardly a beautiful solution. And with enough unique users this loop might take too long. I haven't tested it in hopes someone will come up with a better suggestion, so not if this loop would turn out to be fast enough.
Is there any prettier solution to my problem?
Of course all of this unique user counting could be accomplished easily with Google Analytics, but I am constructing a dashboard of application specific metrics, and intend this to be the first of many stats.
As of SDK v1.7.4, there is now experimental support for the DISTINCT function.
See : https://developers.google.com/appengine/docs/python/datastore/gqlreference
Here is a possibly-workable solution. It relies to an extent on using memcache, so there is always the possibility that your data would get evicted in an unpredictable fashion. Caveat emptor.
You would have a memcache variable called unique_visits_today or something similar. Every time that a user had their first pageview of the day, you would use the .incr() function to increment that counter.
Determining that this is the user's first visit is accomplished by looking at a last_activity_day field attached to the user. When the user visits, you look at that field, and if it is yesterday, you update it to today and increment your memcache counter.
At midnight each day, a cron job would take the current value in the memcache counter and write it to the datastore while setting the counter to zero. You would have a model like this:
class UniqueVisitsRecord(db.Model):
# be careful setting date correctly if processing at midnight
activity_date = db.DateProperty()
event_count = IntegerProperty()
You could then simply, easily, quickly get all of the UnqiueVisitsRecords that match any date range and add up the numbers in their event_count fields.
NDB still does not support DISTINCT. I have written a small utility method to be able to use distinct with GAE.
See here. http://verysimplescripts.blogspot.jp/2013/01/getting-distinct-properties-with-ndb.html
Google App Engine and more particular GQL does not support a DISTINCT function.
But you can use Python's set function as described in this blog and in this SO question.