I want to use a management command to run a one-time analysis of the buildings in Massachusetts. I have reduced the offending code to an 8 line snippet that demonstrates the problem I encounter. The comments just explain why I want to do this at all. I am running the code below verbatim, in an otherwise-blank management command
zips = ZipCode.objects.filter(state='MA').order_by('id')
for zip in zips.iterator():
buildings = Building.objects.filter(boundary__within=zip.boundary)
important_buildings = []
for building in buildings.iterator():
# Some conditionals would go here
important_buildings.append(building)
# Several types of analysis would be done on important_buildings, here
important_buildings = None
When I run this exact code, I find that memory usage steadily increases with each iteration outer loop (I use print('mem', process.memory_info().rss) to check memory usage).
It seems like the important_buildings list is hogging up memory, even after going out of scope. If I replace important_buildings.append(building) with _ = building.pk, it no longer consumes much memory, but I do need that list for some of the analysis.
So, my question is: How can I force Python to release the list of Django models when it goes out of scope?
Edit: I feel like there's a bit of a catch 22 on stack overflow -- if I write too much detail, no one wants to take the time to read it (and it becomes a less applicable problem), but if I write too little detail, I risk overlooking part of the problem. Anyway, I really appreciate the answers, and plan to try some of the suggestions out this weekend when I finally get a chance to get back to this!!
Very quick answer: memory is being freed, rss is not a very accurate tool for telling where the memory is being consumed, rss gives a measure of the memory the process has used, not the memory the process is using (keep reading to see a demo), you can use the package memory-profiler in order to check line by line, the memory use of your function.
So, how to force Django models to be released from memory? You can't tell have such problem just using process.memory_info().rss.
I can, however, propose a solution for you to optimize your code. And write a demo on why process.memory_info().rss is not a very accurate tool to measure memory being used in some block of code.
Proposed solution: as demonstrated later in this same post, applying del to the list is not going to be the solution, optimization using chunk_size for iterator will help (be aware chunk_size option for iterator was added in Django 2.0), that's for sure, but the real enemy here is that nasty list.
Said that, you can use a list of just fields you need to perform your analysis (I'm assuming your analysis can't be tackled one building at the time) in order to reduce the amount of data stored in that list.
Try getting just the attributes you need on the go and select targeted buildings using the Django's ORM.
for zip in zips.iterator(): # Using chunk_size here if you're working with Django >= 2.0 might help.
important_buildings = Building.objects.filter(
boundary__within=zip.boundary,
# Some conditions here ...
# You could even use annotations with conditional expressions
# as Case and When.
# Also Q and F expressions.
# It is very uncommon the use case you cannot address
# with Django's ORM.
# Ultimately you could use raw SQL. Anything to avoid having
# a list with the whole object.
)
# And then just load into the list the data you need
# to perform your analysis.
# Analysis according size.
data = important_buildings.values_list('size', flat=True)
# Analysis according height.
data = important_buildings.values_list('height', flat=True)
# Perhaps you need more than one attribute ...
# Analysis according to height and size.
data = important_buildings.values_list('height', 'size')
# Etc ...
It's very important to note that if you use a solution like this, you'll be only hitting database when populating data variable. And of course, you will only have in memory the minimum required for accomplishing your analysis.
Thinking in advance.
When you hit issues like this you should start thinking about parallelism, clusterization, big data, etc ... Read also about ElasticSearch it has very good analysis capabilities.
Demo
process.memory_info().rss Won't tell you about memory being freed.
I was really intrigued by your question and the fact you describe here:
It seems like the important_buildings list is hogging up memory, even after going out of scope.
Indeed, it seems but is not. Look the following example:
from psutil import Process
def memory_test():
a = []
for i in range(10000):
a.append(i)
del a
print(process.memory_info().rss) # Prints 29728768
memory_test()
print(process.memory_info().rss) # Prints 30023680
So even if a memory is freed, the last number is bigger. That's because memory_info.rss() is the total memory the process has used, not the memory is using at the moment, as stated here in the docs: memory_info.
The following image is a plot (memory/time) for the same code as before but with range(10000000)
I use the script mprof that comes in memory-profiler for this graph generation.
You can see the memory is completely freed, is not what you see when you profile using process.memory_info().rss.
If I replace important_buildings.append(building) with _ = building use less memory
That's always will be that way, a list of objects will always use more memory than a single object.
And on the other hand, you also can see the memory used don't grow linearly as you would expect. Why?
From this excellent site we can read:
The append method is “amortized” O(1). In most cases, the memory required to append a new value has already been allocated, which is strictly O(1). Once the C array underlying the list has been exhausted, it must be expanded in order to accommodate further appends. This periodic expansion process is linear relative to the size of the new array, which seems to contradict our claim that appending is O(1).
However, the expansion rate is cleverly chosen to be three times the previous size of the array; when we spread the expansion cost over each additional append afforded by this extra space, the cost per append is O(1) on an amortized basis.
It is fast but has a memory cost.
The real problem is not the Django models not being released from memory. The problem is the algorithm/solution you've implemented, it uses too much memory. And of course, the list is the villain.
A golden rule for Django optimization: Replace the use of a list for querisets wherever you can.
You don't provide much information about how big your models are, nor what links there are between them, so here are a few ideas:
By default QuerySet.iterator() will load 2000 elements in memory (assuming you're using django >= 2.0). If your Building model contains a lot of info, this could possibly hog up a lot of memory. You could try changing the chunk_size parameter to something lower.
Does your Building model have links between instances that could cause reference cycles that the gc can't find? You could use gc debug features to get more detail.
Or shortcircuiting the above idea, maybe just call del(important_buildings) and del(buildings) followed by gc.collect() at the end of every loop to force garbage collection?
The scope of your variables is the function, not just the for loop, so breaking up your code into smaller functions might help. Although note that the python garbage collector won't always return memory to the OS, so as explained in this answer you might need to get to more brutal measures to see the rss go down.
Hope this helps!
EDIT:
To help you understand what code uses your memory and how much, you could use the tracemalloc module, for instance using the suggested code:
import linecache
import os
import tracemalloc
def display_top(snapshot, key_type='lineno', limit=10):
snapshot = snapshot.filter_traces((
tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
tracemalloc.Filter(False, "<unknown>"),
))
top_stats = snapshot.statistics(key_type)
print("Top %s lines" % limit)
for index, stat in enumerate(top_stats[:limit], 1):
frame = stat.traceback[0]
# replace "/path/to/module/file.py" with "module/file.py"
filename = os.sep.join(frame.filename.split(os.sep)[-2:])
print("#%s: %s:%s: %.1f KiB"
% (index, filename, frame.lineno, stat.size / 1024))
line = linecache.getline(frame.filename, frame.lineno).strip()
if line:
print(' %s' % line)
other = top_stats[limit:]
if other:
size = sum(stat.size for stat in other)
print("%s other: %.1f KiB" % (len(other), size / 1024))
total = sum(stat.size for stat in top_stats)
print("Total allocated size: %.1f KiB" % (total / 1024))
tracemalloc.start()
# ... run your code ...
snapshot = tracemalloc.take_snapshot()
display_top(snapshot)
Laurent S's answer is quite on the point (+1 and well done from me :D).
There are some points to consider in order to cut down in your memory usage:
The iterator usage:
You can set the chunk_size parameter of the iterator to something as small as you can get away with (ex. 500 items per chunk).
That will make your query slower (since every step of the iterator will reevaluate the query) but it will cut down in your memory consumption.
The only and defer options:
defer(): In some complex data-modeling situations, your models might contain a lot of fields, some of which could contain a lot of data (for example, text fields), or require expensive processing to convert them to Python objects. If you are using the results of a queryset in some situation where you don’t know if you need those particular fields when you initially fetch the data, you can tell Django not to retrieve them from the database.
only(): Is more or less the opposite of defer(). You call it with the fields that should not be deferred when retrieving a model. If you have a model where almost all the fields need to be deferred, using only() to specify the complementary set of fields can result in simpler code.
Therefore you can cut down on what you are retrieving from your models in each iterator step and keep only the essential fields for your operation.
If your query still remains too memory heavy, you can choose to keep only the building_id in your important_buildings list and then use this list to make the queries you need from your Building's model, for each of your operations (this will slow down your operations, but it will cut down on the memory usage).
You may improve your queries so much as to solve parts (or even whole) of your analysis but with the state of your question at this moment I cannot tell for sure (see PS on the end of this answer)
Now let's try to bring all the above points together in your sample code:
# You don't use more than the "boundary" field, so why bring more?
# You can even use "values_list('boundary', flat=True)"
# except if you are using more than that (I cannot tell from your sample)
zips = ZipCode.objects.filter(state='MA').order_by('id').only('boundary')
for zip in zips.iterator():
# I would use "set()" instead of list to avoid dublicates
important_buildings = set()
# Keep only the essential fields for your operations using "only" (or "defer")
for building in Building.objects.filter(boundary__within=zip.boundary)\
.only('essential_field_1', 'essential_field_2', ...)\
.iterator(chunk_size=500):
# Some conditionals would go here
important_buildings.add(building)
If this still hogs too much memory for your liking you can use the 3rd point above like this:
zips = ZipCode.objects.filter(state='MA').order_by('id').only('boundary')
for zip in zips.iterator():
important_buildings = set()
for building in Building.objects.filter(boundary__within=zip.boundary)\
.only('pk', 'essential_field_1', 'essential_field_2', ...)\
.iterator(chunk_size=500):
# Some conditionals would go here
# Create a set containing only the important buildings' ids
important_buildings.add(building.pk)
and then use that set to query your buildings for the rest of your operations:
# Converting set to list may not be needed but I don't remember for sure :)
Building.objects.filter(pk__in=list(important_buildings))...
PS: If you can update your answer with more specifics, like the structure of your models and some of the analysis operations you are trying to run, we may be able to provide more concrete answers to help you!
Have you considered Union? By looking at the code you posted you are running a lot of queries within that command but you could offload that to the database with Union.
combined_area = FooModel.objects.filter(...).aggregate(area=Union('geom'))['area']
final = BarModel.objects.filter(coordinates__within=combined_area)
Tweaking the above could essentially narrow down the queries needed for this function to one.
It's also worth looking at DjangoDebugToolbar - if you haven't looked it it already.
To release memory, you must duplicate the important details of each in the buildings in the inner loop into a new object, to be used later, while eliminating those not suitable. In code not shown in the original post references to the inner loop exist. Thus the memory issues. By copying the relevant fields to new objects, the originals can be deleted as intended.
Related
I need to put the texts contained in a column of a MySQL database (about 3 million rows) into a list of lists of tokens. These texts (which are tweets, therefore they are generally short) must be preprocessed before being included in the list (stop words, hashtags, tags etc. must be removed). This list should be passed later as a Word2Vec parameter. This is the part of the code involved
import mysql.connector
import re
from gensim.models import Word2Vec
import preprocessor as p
p.set_options(
p.OPT.URL,
p.OPT.MENTION,
p.OPT.HASHTAG,
p.OPT.NUMBER
)
conn = mysql.connector.connect(...)
cursor = conn.cursor()
query = "SELECT text FROM tweet"
cursor.execute(query)
table = cursor.fetchall()
stopwords = open('stopwords.txt', encoding='utf-8').read().split('\n')
sentences = []
for row in table:
sentences = sentences + [[w for w in re.sub(r'[^\w\s-]', ' ', p.clean(row[0])).lower().split() if w not in stopwords and len(w) > 2]]
cursor.close()
conn.close()
model = Word2Vec(sentences)
...
Obviously it takes a lot of time and I know that my method is probably inefficient. Can anyone recommend a better one? I know it is not a question directly related to gensim and Word2Vec but perhaps those who use them have already faced the problem of working with a large amount of texts.
You haven't mentioned how long your code takes to run, but some potential sources of slowdown in your current technique might include:
the overhead of regex-based preprocessing, especially if a large number of independent regexes are each applied, separately, to the same texts
the inefficiency of expanding a Python list by appending one new item at a time - which as the list grows larger can sometimes be a factor
virtual-memory swapping, if the size of your data exceeds physical RAM
You can check the swapping issue by monitoring memory use using a platform-specific tool (like top on Linux systems) to view memory usage during the operation. If that's a contributor, using a machine with more RAM, or making other code changes to reduce RAM usage (see below), will help.
Your full prprocessing code isn't shown, but a common approach is a lot of independent steps, each of which involves one or more regular-expressions, but then returns a plain modified string (for future steps).
As appealingly simple & pluggable as that is, it often becomes a source of avoidable slowness in preprocessing large amounts of text. For example, each regex/step itself might have to repeat detecting token-boundaries, or splitting then re-concatenating a string. Or, the regexes might use complex match patterns, or techniques (like backtracking) that can be expensive on worst-case inputs.
Often this sort of preprocessing can be greatly improved by one or more of:
coalescing multiple regexes into a single step, so a string faces one front-to-back pass, rather than N
breaking into short tokens early, then leaving the text as a list-of-tokens for later steps - thus never redundantly splitting/joining, and letting later token-oriented steps to work on smaller strings and perhaps even simpler (non-regex) string-tests
Also, even if the preprocessing is still a bit time-consuming, a big process improvement is usually to be sure to only repeat it when the data changes. That is, if you're going to try a bunch of different downstream steps, like different Word2Vec parameters, make sure you're not doing the expensive preprocessing every time. Do it once, write the results aside to a file, then reuse the results file until it needs to be regenerated (because the data or preprocessing rules have changed).
Finally, if the append-one-more pattern is contributing to your slowness, you could pre-allocate your sentences (sentences = [Null,] * desired_length), then replace each row in your loop rather than append (sentences[row_num] = preprocessed_text). But that might not be a major factor, and in fact the suggestion above, about "reuse the results file", is a better way to minimize list-ops/RAM-usage, as well as enable reuse across alternate runs.
That is, open a new working file before your loop. Append each preprocessed text – with spaces between the tokens, and a newline at the end – as one new line to this file. Then, have your Word2Vec step work directly from that file. (In Gensim, you can do this by wrapping the file with a LineSentence utility object, which reads a file of that format back as a re-iterable sequence, with each item being a list-of-tokens, or by using the corpus_file parameter to feed the filename directly to Word2Vec.)
From that list of possible tactics, I'd try:
First, time your existing code for preprocessing (creating your sentences
Then, eliminate all fancy preprocessing, doing nothing more complicated than .split(), and re-time. If there's a big change, then yes, the preprocessing is the major slowdown, and concentrate on improving that.
If even that minimal preprocessing still seems slower-than-desired, then maybe the RAM/concatenation issues are a concern, and try writing to an interim file.
Separately: it's not strictly necessary to worry about removing stop-words in word2vec training - much published work doesn't bother with that step, and the algorithm already includes a sample parameter which causes it to skip a lot of the very-overrepresented words during training as less-interesting. Similarly, 2- and even 1- character tokens may still be interesting, especially in the domain of tweets, so you might not want to always discard them. (For example, lone emoji can be significant 'words'.)
So, I have written an autocomplete and autocorrect program in Python 2. I have written the autocorrect program using the approach mentioned is Peter Norvig's blog on how to write a spell checker, link.
Now, I am using a trie data structure implemented using nested lists. I am using a trie as it can give me all words starting with a particular prefix.At the leaf would be a tuple with the word and a value denoting the frequency of the word.For e.g.- the words bad,bat,cat would be saved as-
['b'['a'['d',('bad',4),'t',('bat',3)]],'c'['a'['t',('cat',4)]]]
Where 4,3,4 are the number times the words have been used or the frequency value. Similarly I have made a trie of about 130,000 words of the english dictionary and stored it using cPickle.
Now, it takes about 3-4 seconds for the entire trie to be read each time.The problem is each time a word is encountered the frequency value has to be incremented and then the updated trie needs to be saved again. As you can imagine it would be a big problem waiting each time for 3-4 seconds to read and then again that much time to save the updated trie each time. I will need to perform a lot of update operations each time the program is run and save them.
Is there a faster or efficient way to store a large data structure which repeatedly will be updated? How are the data structures of the autocorrect programs in IDEs and mobile devices saved & retrieved so fast? I am open to different approaches as well.
A few things come to mind.
1) Split the data. Say use 26 files each storing the tries starting with a certain character. You can improve it so that you use a prefix. This way the amount of data you need to write is less.
2) Don't reflect everything to disk. If you need to perform a lot of operations do them in ram(memory) and write them down at then end. If you're afraid of data loss, you can checkpoint your computation after some time X or after a number of operations.
3) Multi-threading. Unless you program only does spellchecking, it's likely there are other things it needs to do. Have a separate thread that does loading writing so that it doesn't block everything while it does disk IO. Multi-threading in python is a bit tricky but it can be done.
4) Custom structure. Part of the time spent in serialization is invoking serialization functions. Since you have a dictionary for everything that's a lot of function calls. In the perfect case you should have a memory representation that matches exactly the disk representation. You would then simply read a large string and put it into your custom class (and write that string to disk when you need to). This is a bit more advanced and likely the benefits will not be that huge especially since python is not so efficient in playing with bits, but if you need to squeeze the last bit of speed out of it, this is the way to go.
I would suggest you to move serialization to a separate thread and run it periodically. You don't need to re-read your data each time because you already have the latest version in memory. This way your program would be responsive to the user while the data is being saved to the disk. The saved version on disk may be lagging and the latest updates may get lost in case of program crash but this shouldn't be a big issue for your use case, I think.
It depends on a particular use case and environment but, I think, most programs having local data sets sync them using multi-threading.
I have a bunch of flat files that basically store millions of paths and their corresponding info (name, atime, size, owner, etc)
I would like to compile a full list of all the paths stored collectively on the files. For duplicate paths only the largest path needs to be kept.
There are roughly 500 files and approximately a million paths in the text file. The files are also gzipped. So far I've been able to do this in python but the solution is not optimized as for each file it basically takes an hour to load and compare against the current list.
Should I go for a database solution? sqlite3? Is there a data structure or better algorithm to go about this in python? Thanks for any help!
So far I've been able to do this in python but the solution is not optimized as for each file it basically takes an hour to load and compare against the current list.
If "the current list" implies that you're keeping track of all of the paths seen so far in a list, and then doing if newpath in listopaths: for each line, then each one of those searches takes linear time. If you have 500M total paths, of which 100M are unique, you're doing O(500M*100M) comparisons.
Just changing that list to a set, and changing nothing else in your code (well, you need to replace .append with .add, and you can probably remove the in check entirely… but without seeing your code it's hard to be specific) makes each one of those checks take constant time. So you're doing O(500M) comparisons—100M times faster.
Another potential problem is that you may not have enough memory. On a 64-bit machine, you've got enough virtual memory to hold almost anything you want… but if there's not enough physical memory available to back that up, eventually you'll spend more time swapping data back and forth to disk than doing actual work, and your program will slow to a crawl.
There are actually two potential sub-problems here.
First, you might be reading each entire file in at once (or, worse, all of the files at once) when you don't need to (e.g., by decompressing the whole file instead of using gzip.open, or by using f = gzip.open(…) but then doing f.readlines() or f.read(), or whatever). If so… don't do that. Just iterate over the lines in each GzipFile, for line in f:.
Second, maybe even a simple set of however many unique lines you have is too much to fit in memory on your computer. In that case, you probably want to look at a database. But you don't need anything as complicated as sqlite. A dbm acts like a dict (except that its keys and values have to be byte strings), but it's stored on-dict, caching things in memory where appropriate, instead of stored in memory, paging to disk randomly, which means it will go a lot faster in this case. (And it'll be persistent, too.) Of course you want something that acts like a set, not a dict… but that's easy. You can model a set as a dict whose keys are always ''. So instead of paths.add(newpath), it's just paths[newpath] = ''. Yeah, that wastes a few bytes of disk space over building your own custom on-disk key-only hash table, but it's unlikely to make any significant difference.
I am using GAE to run a pet project.
I have a large table (100K rows) that I am running an indexed query against. That seems fine. However, iterating through the results seems to take non-linear time. Doing some profiling, it seems that for the first batch of rows (100 or so) it is acting linearly, but then falls off a cliff and starts taking more and more time for reach row retrieved. Here is the code sketch:
q = Metrics.all()
q.filter('Tag =', 'All')
q.order('-created')
iterator = q.run(limit = 100)
l = []
for i in iterator:
l.append[i.created]
Any idea what could cause this to behave non-linearly?
Most likely because your are not making use of the Query Cursors, use them instead and you'll see your performance improved.
Also it looks like that you are using the old DB, consider switching to NDB, since the latest implementation suppose to be better and faster.
If you know the exact number you want to process, consider using fetch. run will fetch your result in smaller chunk (default batch size is 20), there will be extra round trip operation with that.
OOT: maybe good to rename the list variable, it has same name as python list function :)
I am building a flexible, lightweight, in-memory database in Python, and discovered a performance problem with the way I was looking up values and using indexes. In an effort to improve this I've tried a few options, trying to balance speed with memory usage. My current implementation uses a dict of dicts to store data by record (object reference) and field (also an object reference). So for example, if I have three records with three fields, where some of the data is missing (i.e. NULL values)::
{<Record1>: {<Field1>: 4, <Field2>: 'value', <Field3>: <Other Record>},
{<Record2>: {<Field1>: 4, <Field2>: 'value'},
{<Record3>: {<Field1>: 5}}
I considered a numpy array, but I would still need two dictionaries to map object instances to array indexes, so I can't see that it will perform be any better.
Indexes are implemented using a pair of bisected lists, essentially acting as a map from value to record instance. For example, and index on the above Field1>:
[[4, 4, 5], [<Record1>, <Record2>, <Record3>]]
I was previously using a simple dict of bins, but this didn't allow range lookups (e.g. all values > 5) (see Python hash table for fuzzy matching).
My question is this. I am concerned that I have several object references, and multiple copies of the same values in the indexes. Do all these duplicate references actually use more memory, or are references cheap in python? My alternative is to try to associate a numerical key to each object, which might improve things at least up to 256, but I don't know enough about how python handles references to know if this would really be any better.
Does anyone have any suggestions of a better way to manage this?
Reimplementing the critical parts in C is an option I want to keep as a last resort.
For anyone interested, my code is here.
Edit 1:
The question, simple put, is which of the following is more efficient in terms of memory usage, where a is an object instance and i is an integer:
[a] * 1000
Or
[i] * 1000, {a: i}
Edit 2:
Because of the large number of comments suggesting I use an existing system, here are my requirements. If anyone can suggest a system which fulfills all of these, that would be great, but so far I have not found anything which does. Otherwise, my original question still relates to memory usage of references in python.:
Must be light-weight and in-memory. Definitely not a client/server model.
Need to be able to easily alter tables, change fields, change rules, etc, on the fly.
Need to easily apply very complex validation rules. SQL doesn't meet this requirement. Although it is sometimes possible to build up very complicated statements, it is far from easy.
Need to support joins and associations between tables. Many NoSQL databases don't support joins at all, or at most only simple joins.
Need to support a method of loading and storing data to any file format. I am currently implementing this by providing a framework which makes it easy to add new formats as needed.
It does not need persistence (beyond storing data as in the previous point), and does not need to handle massive amounts of data, i.e. not more than a couple of million records. Typically, I am dealing with a few thousand.
Each reference is in effect a pointer, each pointer requires a small amount of memory.
You can use memory profiler to view memory use on a line by line basis. In this way you can see what happens when you make a reference.
Python does not specify a particular implementation for dynamic memory management, but from the semantics of the language one can assume that a reference uses memory similar to a C pointer.
FWIW, I ran some tests on a 100x100 structure, testing a sparsely populated dictionary structure, a fully populated dictionary structure, a list, and a numpy array. The latter two had a dictionary mapping object references to indexes. I timed getting every item in the structure by index (returning a sentinel for missing data in the sparse dict), and also reported the total size. My results were somewhat surprising:
Structure Time Size
============= ======== =====
full dict 0.0236s 6284
list 0.0426s 13028
sparse dict 0.1079s 1676
array 0.2262s 12608
So the fastest and second smallest was a full dict, presumable because there was no need to run a key in dict check on it.