Google App Engine: Increasingly long time to read data

Google App Engine: Increasingly long time to read data - python

I am using GAE to run a pet project.
I have a large table (100K rows) that I am running an indexed query against. That seems fine. However, iterating through the results seems to take non-linear time. Doing some profiling, it seems that for the first batch of rows (100 or so) it is acting linearly, but then falls off a cliff and starts taking more and more time for reach row retrieved. Here is the code sketch:
q = Metrics.all()
q.filter('Tag =', 'All')
q.order('-created')
iterator = q.run(limit = 100)
l = []
for i in iterator:
l.append[i.created]
Any idea what could cause this to behave non-linearly?

Most likely because your are not making use of the Query Cursors, use them instead and you'll see your performance improved.
Also it looks like that you are using the old DB, consider switching to NDB, since the latest implementation suppose to be better and faster.

If you know the exact number you want to process, consider using fetch. run will fetch your result in smaller chunk (default batch size is 20), there will be extra round trip operation with that.
OOT: maybe good to rename the list variable, it has same name as python list function :)

Related

Which has optimal performance for generating a randomized list: `random.shuffle(ids)` or `.order_by("?")`?

I need to generate a randomized list of 50 items to send to the front-end for a landing page display. The landing page already loads much too slowly, so any optimization would be wonderful!
Given the pre-existing performance issues and the large size of this table, I'm wondering which implementation is better practice, or if the difference is negligible:
Option A:
unit_ids = list(units.values_list('id', flat=True).distinct())
random.shuffle(unit_ids)
unit_ids = unit_ids[:50]
Option B:
list(units.values_list('id', flat=True).order_by("?")[:50])
My concern is that according to the django docs, order_by('?') "may be expensive and slow"
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.order_by
We are using a MySQL db. I've tried searching for more info about implementation, but I'm not seeing anything more specific than what's in the docs. Help!

Option B should be faster in most cases since a database engine is usually faster than a code in python.
In option A, you are retrieving some ids which should be all the ids by my guess and then you are shuffling them on python. and according to you, the table is large so that makes it a bad idea to do it in python. Also, you are only getting the ids which mean if you need the actual data, you have to make another query.
With all the explanations, you should still try both and see which one is faster because they both depend on different variables. Just time them both and see which one works faster for you and then go with that.

Tradeoffs:
Shoveling large amounts of data to the client (TEXT columns; all the rows; etc)
Whether the table is so big that fetching N random rows is likely to hit the disk N times.
My first choice would be simply:
SELECT * FROM t ORDER BY RAND() LIMIT 50;
My second choice would be to use "lazy loading" (not unlike your random.shuffle, but better because it does not need a second round-trip):
SELECT t.*
FROM ( SELECT id FROM t ORDER BY RAND() LIMIT 50 ) AS r
JOIN t USING(id)
If that is not "fast enough", then first find out whether the subquery is the slowdown or the outer query.
If the inner query is the problem, then see http://mysql.rjweb.org/doc.php/random
If the outer query is the problem, you are doomed. It is already optimal (assuming PRIMARY KEY(id)).

Will the for loop effect the speed in pyspark dataframe

I have this code which splits the dataframe in 10000 rows and writes to file.
I tried instance with z1d with 24cpu and 192GB but even that didn't do much speed and for 1 million rows it took 9 mins.
This is code
total = df2.count()
offset = 10000
counter = int(total/offset) + 1
idxDf = df.withColumn("idx", monotonically_increasing_id())
for i in range(0, counter):
lower = i * offset
upper = lower + offset
filter = f"idx > {lower} and idx < {upper}"
ddf = idxDf.filter(filter)
ddf2 = ddf.drop("idx")
ddf2.write.option("header", "false").option("delimiter", " ").option("compression","gzip").csv(outputpath)
Is there any way i can make in it. Currently i am using single master node only. I have 100 million rows and want to know how fast i can do that with emr.
Look like my normal python code is also able to do the same stuff in same minutes

A few problems with what you’re trying to do here:
Stop trying to write pyspark code as if it’s normal python code. It isn’t. Read up on exactly how spark works first and foremost. You’ll have more success if you change the way you program when you use spark, not try to get spark to do what you want in the way you want.
Avoid for loops with Spark wherever possible. for loops only work within native python, so you’re not utilising spark when you start one. Which means one CPU on one Spark node will run the code.
Python is, by default, single threaded. Adding more CPUs will do literally nothing to performance for native python code (ie your for loop) unless you rewrite your code for either (a) multi-threaded processing (b) distributed processing (ie spark).
You only have one master node (and I assume zero slaves nodes). That’s going to take aaaaaaggggggggeeeessss to process a 192GB file. The point of Spark is to distribute the workload onto many other slave nodes. There’s some really technical ways to determine the optimal number of slave nodes for your problem. Try something like >50 or >100 or slaves. Should help you see a decent performance uplift (each node able to process at least between 1gb-4gb of data). Still too slow? Either add more slave nodes, or choose more powerful machines for the slaves. I remember running a 100GB file through some heavy lifting took a whole day on 16 nodes. Upping the machine spec and number of slaves brought it down to an hour.
For writing files, don’t try and reinvent the wheel if you don’t need to.
Spark will automatically write your files in a distributed manner according to the level of partitioning on the dataframe. On disk, it should create a directory called outputpath which contains the n distributed files:
df.repartition(n_files)
df.write.option("header", "false").option("delimiter", " ").option("compression","gzip").csv(outputpath)
You should get a directory structured something like this:
path/to/outputpath:
- part-737hdeu-74dhdhe-uru24.csv.gz
- part-24hejje—hrhehei-47dhe.csv.gz
- ...
Hope this helps. Also, partitioning is super important. If your initial file is not distributed (one big csv), it’s a good idea to do df.repartition(x) on the resulting dataframe after you load it, where x = number of slave nodes.

Dataframe writing to Postgresql poor performance

working in postgresql I have a cartesian join producing ~4 million rows.
The join takes ~5sec and the write back to the DB takes ~1min 45sec.
The data will be required for use in python, specifically in a pandas dataframe, so I am experimenting with duplicating this same data in python. I should say here that all these tests are running on one machine, so nothing is going across a network.
Using psycopg2 and pandas, reading in the data and performing the join to get the 4 million rows (from an answer here:cartesian product in pandas) takes consistently under 3 secs, impressive.
Writing the data back to a table in the database however takes anything from 8 minutes (best method) to 36+minutes (plus some methods I rejected as I had to stop them after >1hr).
While I was not expecting to reproduce the "sql only" time, I would hope to be able to get closer than 8 minutes (I`d have thought 3-5 mins would not be unreasonable).
Slower methods include:
36min - sqlalchemy`s table.insert (from 'test_sqlalchemy_core' here https://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow)
13min - psycopg2.extras.execute_batch (https://stackoverflow.com/a/52124686/3979391)
13-15min (depends on chunksize) - pandas.dataframe.to_sql (again using sqlalchemy) (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html)
Best way (~8min) is using psycopg2`s cursor.copy_from method (found here: https://github.com/blaze/odo/issues/614#issuecomment-428332541).
This involves dumping the data to a csv first (in memory via io.StringIO), that alone takes 2 mins.
So, my questions:
Anyone have any potentially faster ways of writing millions of rows from a pandas dataframe to postgresql?
The docs for the cursor.copy_from method (http://initd.org/psycopg/docs/cursor.html) state that the source object needs to support the read() and readline() methods (hence the need for io.StringIO). Presumably, if the dataframe supported those methods, we could dispense with the write to csv. Is there some way to add these methods?
Thanks.
Giles
EDIT:
On Q2 - pandas can now use a custom callable for to_sql and the given example here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method does pretty much what I suggest above (IE it copies csv data directly from STDIN using StringIO).
I found an ~40% increase in write speed using this method, which brings to_sql close to the "best" method mentioned above.

Answering Q 1 myself:
It seems the issue had more to do with Postgresql (or rather Databases in general). Taking into account points made in this article:https://use-the-index-luke.com/sql/dml/insert I found the following:
1) Removing all indexes from the destination table resulted in the query running in 9 seconds. Rebuilding the indexes (in postgresql) took a further 12 seconds, so still well under the other times.
2) With only a primary key in place, Inserting rows ordered by the primary key columns reduced the time taken to about a third. This makes sense as there should be little or no shuffling of the index rows required. I also verified that this is the reason why my cartesian join in postgresql was faster in the first place (IE the rows were ordered by the index, purely by chance), placing the same rows in a temporary table (unordered) and inserting from that actually took allot longer.
3) I tried similar experiments on our mysql systems and found the same increase in insert speed when removing indexes. With mysql however it seemed that rebuilding the indexes used up any time gained.
I hope this helps anyone else who comes across this question from a search.
I still wonder if it is possible to remove the write to csv step in python (Q2 above) as I believe I could then write something in python that would be faster than pure postgresql.
Thanks, Giles

How to force Django models to be released from memory

I want to use a management command to run a one-time analysis of the buildings in Massachusetts. I have reduced the offending code to an 8 line snippet that demonstrates the problem I encounter. The comments just explain why I want to do this at all. I am running the code below verbatim, in an otherwise-blank management command
zips = ZipCode.objects.filter(state='MA').order_by('id')
for zip in zips.iterator():
buildings = Building.objects.filter(boundary__within=zip.boundary)
important_buildings = []
for building in buildings.iterator():
# Some conditionals would go here
important_buildings.append(building)
# Several types of analysis would be done on important_buildings, here
important_buildings = None
When I run this exact code, I find that memory usage steadily increases with each iteration outer loop (I use print('mem', process.memory_info().rss) to check memory usage).
It seems like the important_buildings list is hogging up memory, even after going out of scope. If I replace important_buildings.append(building) with _ = building.pk, it no longer consumes much memory, but I do need that list for some of the analysis.
So, my question is: How can I force Python to release the list of Django models when it goes out of scope?
Edit: I feel like there's a bit of a catch 22 on stack overflow -- if I write too much detail, no one wants to take the time to read it (and it becomes a less applicable problem), but if I write too little detail, I risk overlooking part of the problem. Anyway, I really appreciate the answers, and plan to try some of the suggestions out this weekend when I finally get a chance to get back to this!!

Very quick answer: memory is being freed, rss is not a very accurate tool for telling where the memory is being consumed, rss gives a measure of the memory the process has used, not the memory the process is using (keep reading to see a demo), you can use the package memory-profiler in order to check line by line, the memory use of your function.
So, how to force Django models to be released from memory? You can't tell have such problem just using process.memory_info().rss.
I can, however, propose a solution for you to optimize your code. And write a demo on why process.memory_info().rss is not a very accurate tool to measure memory being used in some block of code.
Proposed solution: as demonstrated later in this same post, applying del to the list is not going to be the solution, optimization using chunk_size for iterator will help (be aware chunk_size option for iterator was added in Django 2.0), that's for sure, but the real enemy here is that nasty list.
Said that, you can use a list of just fields you need to perform your analysis (I'm assuming your analysis can't be tackled one building at the time) in order to reduce the amount of data stored in that list.
Try getting just the attributes you need on the go and select targeted buildings using the Django's ORM.
for zip in zips.iterator(): # Using chunk_size here if you're working with Django >= 2.0 might help.
important_buildings = Building.objects.filter(
boundary__within=zip.boundary,
# Some conditions here ...
# You could even use annotations with conditional expressions
# as Case and When.
# Also Q and F expressions.
# It is very uncommon the use case you cannot address
# with Django's ORM.
# Ultimately you could use raw SQL. Anything to avoid having
# a list with the whole object.
)
# And then just load into the list the data you need
# to perform your analysis.
# Analysis according size.
data = important_buildings.values_list('size', flat=True)
# Analysis according height.
data = important_buildings.values_list('height', flat=True)
# Perhaps you need more than one attribute ...
# Analysis according to height and size.
data = important_buildings.values_list('height', 'size')
# Etc ...
It's very important to note that if you use a solution like this, you'll be only hitting database when populating data variable. And of course, you will only have in memory the minimum required for accomplishing your analysis.
Thinking in advance.
When you hit issues like this you should start thinking about parallelism, clusterization, big data, etc ... Read also about ElasticSearch it has very good analysis capabilities.
Demo
process.memory_info().rss Won't tell you about memory being freed.
I was really intrigued by your question and the fact you describe here:
It seems like the important_buildings list is hogging up memory, even after going out of scope.
Indeed, it seems but is not. Look the following example:
from psutil import Process
def memory_test():
a = []
for i in range(10000):
a.append(i)
del a
print(process.memory_info().rss) # Prints 29728768
memory_test()
print(process.memory_info().rss) # Prints 30023680
So even if a memory is freed, the last number is bigger. That's because memory_info.rss() is the total memory the process has used, not the memory is using at the moment, as stated here in the docs: memory_info.
The following image is a plot (memory/time) for the same code as before but with range(10000000)
I use the script mprof that comes in memory-profiler for this graph generation.
You can see the memory is completely freed, is not what you see when you profile using process.memory_info().rss.
If I replace important_buildings.append(building) with _ = building use less memory
That's always will be that way, a list of objects will always use more memory than a single object.
And on the other hand, you also can see the memory used don't grow linearly as you would expect. Why?
From this excellent site we can read:
The append method is “amortized” O(1). In most cases, the memory required to append a new value has already been allocated, which is strictly O(1). Once the C array underlying the list has been exhausted, it must be expanded in order to accommodate further appends. This periodic expansion process is linear relative to the size of the new array, which seems to contradict our claim that appending is O(1).
However, the expansion rate is cleverly chosen to be three times the previous size of the array; when we spread the expansion cost over each additional append afforded by this extra space, the cost per append is O(1) on an amortized basis.
It is fast but has a memory cost.
The real problem is not the Django models not being released from memory. The problem is the algorithm/solution you've implemented, it uses too much memory. And of course, the list is the villain.
A golden rule for Django optimization: Replace the use of a list for querisets wherever you can.

You don't provide much information about how big your models are, nor what links there are between them, so here are a few ideas:
By default QuerySet.iterator() will load 2000 elements in memory (assuming you're using django >= 2.0). If your Building model contains a lot of info, this could possibly hog up a lot of memory. You could try changing the chunk_size parameter to something lower.
Does your Building model have links between instances that could cause reference cycles that the gc can't find? You could use gc debug features to get more detail.
Or shortcircuiting the above idea, maybe just call del(important_buildings) and del(buildings) followed by gc.collect() at the end of every loop to force garbage collection?
The scope of your variables is the function, not just the for loop, so breaking up your code into smaller functions might help. Although note that the python garbage collector won't always return memory to the OS, so as explained in this answer you might need to get to more brutal measures to see the rss go down.
Hope this helps!
EDIT:
To help you understand what code uses your memory and how much, you could use the tracemalloc module, for instance using the suggested code:
import linecache
import os
import tracemalloc
def display_top(snapshot, key_type='lineno', limit=10):
snapshot = snapshot.filter_traces((
tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
tracemalloc.Filter(False, "<unknown>"),
))
top_stats = snapshot.statistics(key_type)
print("Top %s lines" % limit)
for index, stat in enumerate(top_stats[:limit], 1):
frame = stat.traceback[0]
# replace "/path/to/module/file.py" with "module/file.py"
filename = os.sep.join(frame.filename.split(os.sep)[-2:])
print("#%s: %s:%s: %.1f KiB"
% (index, filename, frame.lineno, stat.size / 1024))
line = linecache.getline(frame.filename, frame.lineno).strip()
if line:
print(' %s' % line)
other = top_stats[limit:]
if other:
size = sum(stat.size for stat in other)
print("%s other: %.1f KiB" % (len(other), size / 1024))
total = sum(stat.size for stat in top_stats)
print("Total allocated size: %.1f KiB" % (total / 1024))
tracemalloc.start()
# ... run your code ...
snapshot = tracemalloc.take_snapshot()
display_top(snapshot)

Laurent S's answer is quite on the point (+1 and well done from me :D).
There are some points to consider in order to cut down in your memory usage:
The iterator usage:
You can set the chunk_size parameter of the iterator to something as small as you can get away with (ex. 500 items per chunk).
That will make your query slower (since every step of the iterator will reevaluate the query) but it will cut down in your memory consumption.
The only and defer options:
defer(): In some complex data-modeling situations, your models might contain a lot of fields, some of which could contain a lot of data (for example, text fields), or require expensive processing to convert them to Python objects. If you are using the results of a queryset in some situation where you don’t know if you need those particular fields when you initially fetch the data, you can tell Django not to retrieve them from the database.
only(): Is more or less the opposite of defer(). You call it with the fields that should not be deferred when retrieving a model. If you have a model where almost all the fields need to be deferred, using only() to specify the complementary set of fields can result in simpler code.
Therefore you can cut down on what you are retrieving from your models in each iterator step and keep only the essential fields for your operation.
If your query still remains too memory heavy, you can choose to keep only the building_id in your important_buildings list and then use this list to make the queries you need from your Building's model, for each of your operations (this will slow down your operations, but it will cut down on the memory usage).
You may improve your queries so much as to solve parts (or even whole) of your analysis but with the state of your question at this moment I cannot tell for sure (see PS on the end of this answer)
Now let's try to bring all the above points together in your sample code:
# You don't use more than the "boundary" field, so why bring more?
# You can even use "values_list('boundary', flat=True)"
# except if you are using more than that (I cannot tell from your sample)
zips = ZipCode.objects.filter(state='MA').order_by('id').only('boundary')
for zip in zips.iterator():
# I would use "set()" instead of list to avoid dublicates
important_buildings = set()
# Keep only the essential fields for your operations using "only" (or "defer")
for building in Building.objects.filter(boundary__within=zip.boundary)\
.only('essential_field_1', 'essential_field_2', ...)\
.iterator(chunk_size=500):
# Some conditionals would go here
important_buildings.add(building)
If this still hogs too much memory for your liking you can use the 3rd point above like this:
zips = ZipCode.objects.filter(state='MA').order_by('id').only('boundary')
for zip in zips.iterator():
important_buildings = set()
for building in Building.objects.filter(boundary__within=zip.boundary)\
.only('pk', 'essential_field_1', 'essential_field_2', ...)\
.iterator(chunk_size=500):
# Some conditionals would go here
# Create a set containing only the important buildings' ids
important_buildings.add(building.pk)
and then use that set to query your buildings for the rest of your operations:
# Converting set to list may not be needed but I don't remember for sure :)
Building.objects.filter(pk__in=list(important_buildings))...
PS: If you can update your answer with more specifics, like the structure of your models and some of the analysis operations you are trying to run, we may be able to provide more concrete answers to help you!

Have you considered Union? By looking at the code you posted you are running a lot of queries within that command but you could offload that to the database with Union.
combined_area = FooModel.objects.filter(...).aggregate(area=Union('geom'))['area']
final = BarModel.objects.filter(coordinates__within=combined_area)
Tweaking the above could essentially narrow down the queries needed for this function to one.
It's also worth looking at DjangoDebugToolbar - if you haven't looked it it already.

To release memory, you must duplicate the important details of each in the buildings in the inner loop into a new object, to be used later, while eliminating those not suitable. In code not shown in the original post references to the inner loop exist. Thus the memory issues. By copying the relevant fields to new objects, the originals can be deleted as intended.

temp store full when iterating over large amount sqlite3 records

I have a large SQLite db where I am joining a 3.5M-row table onto itself. I use SQLite since it is the serialization format of my python3 application and the flatfile format is important in my workflow. When iterating over the rows of this join (around 55M rows) using:
cursor.execute('SELECT DISTINCT p.pid, pp.pname, pp.pid FROM proteins'
'AS p JOIN proteins AS pp USING(pname) ORDER BY p.pid')
for row in cursor:
# do stuff with row.
EXPLAIN QUERY PLAN gives the following:
0|0|0|SCAN TABLE proteins AS p USING INDEX pid_index (~1000000 rows)
0|1|1|SEARCH TABLE proteins AS pp USING INDEX pname_index (pname=?) (~10 rows)
0|0|0|USE TEMP B-TREE FOR DISTINCT
sqlite3 errors with "database or disk is full" after say 1.000.000 rows, which seems to indicate a full SQLite on-disk tempstore. Since I have enough RAM on my current box, that can be solved by setting the tempstore to in memory, but it's suboptimal since in that case all the RAM seems to be used up and I tend to run 4 or so of these processes in parallel. My (probably incorrect) assumption was that the iterator was a generator and would not put a large load on the memory, unlike e.g. fetchall which would load all rows. However I now run out of diskspace (on a small SSD scratch disk) and assuming that SQLite needs to store the results somewhere.
A way around this may be to run chunks of SELECT ... LIMIT x OFFSET y queries, but they get slower for each time a bigger OFFSET is used. Is there any other way to run this? What is stored in these temporary files? They seem to grow the further I iterate.

0|0|0|USE TEMP B-TREE FOR DISTINCT
Here's what's using the disk.
In order to support DISTINCT, SQLite has to store what rows already appeared in the query. For a large number of results, this set can grow huge. So to save on RAM, SQLite will temporarily store the distinct set on disk.
Removing the DISTINCT clause is an easy way to avoid the issue, but it changes the meaning of the query; you can now get duplicate rows. If you don't mind that, or you have unique indices or some other way of ensuring that you never get duplicates, then that won't matter.

What you are trying to do with SQLite3 is a very bad idea, let me try to explain why.
You have the raw data on disk where it fits and is readable.
You generate a result inside of SQLite3 which expands greatly.
You then try to transfer this very large dataset through an sql connector.
Relational databases in general is not made for this kind of operation. SQLite3 is no exception. Relational databases were made for small quick queries that live for a fraction of a second and that returns a couple of rows.
You would be better off using another tool.
Reading the whole dataset into python using Pandas for instance is my recommended solution. Also using itertools is a good idea.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.