Python 3 integers have unlimited precision. In practice, this is limited by a computer's memory.
Consider the followng code:
i = 12345
while True:
i = i * 123
This will obviously fail. But what will be the result of this? The entire RAM (and the page file) filled with this one integer (except for the space occupied by other processes)?
Or is there a safeguard to catch this before it gets that far?
You could check what happens without risking to fill all available memory. You could set the memory limit explicitly:
#!/usr/bin/env python
import contextlib
import resource
#contextlib.contextmanager
def limit(limit, type=resource.RLIMIT_AS):
soft_limit, hard_limit = resource.getrlimit(type)
resource.setrlimit(type, (limit, hard_limit)) # set soft limit
try:
yield
finally:
resource.setrlimit(type, (soft_limit, hard_limit)) # restore
with limit(100 * (1 << 20)): # 100MiB
# do the thing that might try to consume all memory
i = 1
while True:
i <<= 1
This code consumes 100% CPU (on a single core) and the consumed memory grows very very slowly.
In principle, you should get MemoryError at some point whether it happens before your computer turns to dust is unclear. CPython uses a continuous block of memory to store the digits and therefore you may get the error even if there is RAM available but fragmented.
Your specific code shouldn't trigger it but in general you could also get OverflowError if you try to construct an integer larger than sys.maxsize bytes.
Related
I'm bruteforcing a 8-digit pin on a ELF executable (it's for a CTF) and I'm using asynchronous parallel processing. The code is very fast but it fills the memory even faster.
It takes about 10% of the total iterations to fill 8gbs of ram, and I have no idea what's causing it. Any help?
from pwn import *
import multiprocessing as mp
from tqdm import tqdm
def check_pin(pin):
program = process('elf_exe')
program.recvn(36)
program.sendline(str(pin))
program.recvline()
program.recvline()
res = program.recvline()
program.close()
if 'Access denied.' in str(res):
return null, null
else:
return res, pin
def process_result(res, pin):
if(res != null):
print(pin)
if __name__ == '__main__':
print(f'Starting bruteforce on {mp.cpu_count()} cores :)\n')
pool = mp.Pool(mp.cpu_count())
min = 10000000
max = 99999999
for pin in tqdm(range(min, max)):
pool.apply_async(check_pin, args=(pin), callback=process_result)
pool.close()
pool.join()
Multiprocessing pools create several processes. Calls to apply_async create a task that is added to a shared data structure (eg. queue). The data structure is read by processes thanks to inter-process communication (IPC). The thing is apply_async return a synchronization object that you do not use and so there is not synchronizations. Items appended in the data structure take some memory space (at least 32*3=96 bytes due to 3 CPython objects being allocated) and the data structure grow in memory to hold the 89_999_999 items hence at least 8 GiB of RAM. The process are not fast enough to execute the work. What tqdm print is totally is completely misleading: it just print the processing of the number of task submitted, not the one executed that is only a tiny fraction. Almost all the work is done when tqdm print 100% and the submission loop is done. I actually doubt the "code is very fast" since it appears to run 90 millions process while running a process is known to be an expensive operation.
To speed up this code and avoid a big memory usage, you need to aggregate the work in bigger tasks. You can for example and a range of pin variable to be computed and add a loop in check_pin. A reasonable range size is for example 1000. Additionally, you need to accumulate the AsyncResult objects returned by apply_async in a list and perform periodic synchronizations when the list becomes too big so that processes does not have too much work and so the shared data structure can remain small. Here is a simple untested example:
lst = []
for rng in allRanges:
lst.append(pool.apply_async(check_pin, args=(rng), callback=process_result))
if len(lst) > 100:
# Naive synchronization
for i in lst:
i.wait()
lst = []
I am observing memory usage that I cannot explain to myself. Below I provide a stripped down version of my actual code that still exhibits this behavior. The code is intended to accomplish the following:
Read a text file in chunks of 1000 lines. Each line is a sentence. Split these 1000 sentences into 4 generators. Pass these generators to a thread pool and run feature extraction in parallel on 250 sentences.
In my actual code I accumulate features and labels from all sentences of the entire file.
Now here comes the weird thing: Memory gets allocated but not freed again even when not accumulating these values! And it has something to do with the thread pool I think. The amount of memory taken in total is dependent on how many features are extracted for any given word. I simulate this here with range(100). Have a look:
from sys import argv
from itertools import chain, islice
from multiprocessing import Pool
from math import ceil
# dummyfied feature extraction function
# the lengt of the range determines howmuch mamory is used up in total,
# eventhough the objects are never stored
def features_from_sentence(sentence):
return [{'some feature' 'some value'} for i in range(100)], ['some label' for i in range(100)]
# split iterable into generator of generators of length `size`
def chunks(iterable, size=10):
iterator = iter(iterable)
for first in iterator:
yield chain([first], islice(iterator, size - 1))
def features_from_sentence_meta(l):
return list(map (features_from_sentence, l))
def make_X_and_Y_sets(sentences, i):
print(f'start: {i}')
pool = Pool()
# split sentences into a generator of 4 generators
sentence_chunks = chunks(sentences, ceil(50000/4))
# results is a list containing the lists of pairs of X and Y of all chunks
results = map(lambda x : x[0], pool.map(features_from_sentence_meta, sentence_chunks))
X, Y = zip(*results)
print(f'end: {i}')
return X, Y
# reads file in chunks of `lines_per_chunk` lines
def line_chunks(textfile, lines_per_chunk=1000):
chunk = []
i = 0
with open(textfile, 'r') as textfile:
for line in textfile:
if not line.split(): continue
i+=1
chunk.append(line.strip())
if i == lines_per_chunk:
yield chunk
i = 0
chunk = []
yield chunk
textfile = argv[1]
for i, line_chunk in enumerate(line_chunks(textfile)):
# stop processing file after 10 chunks to demonstrate
# that memory stays occupied (check your system monitor)
if i == 10:
while True:
pass
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
The file I am using to debug this has 50000 nonempty lines, which is why I use the hardcoded 50000 at one place. If you want to use the same file, he is a link for your convenience:
https://www.dropbox.com/s/v7nxb7vrrjim349/de_wiki_50000_lines?dl=0
Now when you run this script and open your system monitor you will observe that memory gets used up and the usage keeps going until the 10th chunk, where I artificially go into an endless loop to demonstrate that the memory stays in use, even though I never store anything.
Can you explain to me why this happens? I seem to be missing something about how multiprocessing pools are supposed to be used.
First, let's clear up some misunderstandings—although, as it turns out, this wasn't actually the right avenue to explore in the first place.
When you allocate memory in Python, of course it has to go get that memory from the OS.
When you release memory, however, it rarely gets returned to the OS, until you finally exit. Instead, it goes into a "free list"—or, actually, multiple levels of free lists for different purposes. This means that the next time you need memory, Python already has it lying around, and can find it immediately, without needing to talk to the OS to allocate more. This usually makes memory-intensive programs much faster.
But this also means that—especially on modern 64-bit operating systems—trying to understand whether you really do have any memory pressure issues by looking at your Activity Monitor/Task Manager/etc. is next to useless.
The tracemalloc module in the standard library provides low-level tools to see what actually is going on with your memory usage. At a higher level, you can use something like memory_profiler, which (if you enable tracemalloc support—this is important) can put that information together with OS-level information from sources like psutil to figure out where things are going.
However, if you aren't seeing any actual problems—your system isn't going into swap hell, you aren't getting any MemoryError exceptions, your performance isn't hitting some weird cliff where it scales linearly up to N and then suddenly goes all to hell at N+1, etc.—you usually don't need to bother with any of this in the first place.
If you do discover a problem, then, fortunately, you're already half-way to solving it. As I mentioned at the top, most memory that you allocated doesn't get returned to the OS until you finally exit. But if all of your memory usage is happening in child processes, and those child processes have no state, you can make them exit and restart whenever you want.
Of course there's a performance cost to doing so—process teardown and startup time, and page maps and caches that have to start over, and asking the OS to allocate the memory again, and so on. And there's also a complexity cost—you can't just run a pool and let it do its thing; you have to get involved in its thing and make it recycle processes for you.
There's no builtin support in the multiprocessing.Pool class for doing this.
You can, of course, build your own Pool. If you want to get fancy, you can look at the source to multiprocessing and do what it does. Or you can build a trivial pool out of a list of Process objects and a pair of Queues. Or you can just directly use Process objects without the abstraction of a pool.
Another reason you can have memory problems is that your individual processes are fine, but you just have too many of them.
And, in fact, that seems to be the case here.
You create a Pool of 4 workers in this function:
def make_X_and_Y_sets(sentences, i):
print(f'start: {i}')
pool = Pool()
# ...
… and you call this function for every chunk:
for i, line_chunk in enumerate(line_chunks(textfile)):
# ...
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
So, you end up with 4 new processes for every chunk. Even if each one has pretty low memory usage, having hundreds of them at once is going to add up.
Not to mention that you're probably severely hurting your time performance by having hundreds of processes competing over 4 cores, so you waste time in context switching and OS scheduling instead of doing real work.
As you pointed out in a comment, the fix for this is trivial: just make a single global pool instead of a new one for each call.
Sorry for getting all Columbo here, but… just one more thing… This code runs at the top level of your module:
for i, line_chunk in enumerate(line_chunks(textfile)):
# ...
X_chunk, Y_chunk = make_X_and_Y_sets(line_chunk, i)
… and that's the code that tries to spin up the pool and all the child tasks. But each child process in that pool needs to import this module, which means they're all going to end up running the same code, and spinning up another pool and a whole extra set of child tasks.
You're presumably running this on Linux or macOS, where the default startmethod is fork, which means multiprocessing can avoid this import, so you don't have a problem. But with the other startmethods, this code would basically be a forkbomb that eats up all of your system resources. And that includes spawn, which is the default startmethod on Windows. So, if there's ever any chance anyone might run this code on Windows, you should put all of that top-level code in a if __name__ == '__main__': guard.
I have written an application with flask and uses celery for a long running task. While load testing I noticed that the celery tasks are not releasing memory even after completing the task. So I googled and found this group discussion..
https://groups.google.com/forum/#!topic/celery-users/jVc3I3kPtlw
In that discussion it says, thats how python works.
Also the article at https://hbfs.wordpress.com/2013/01/08/python-memory-management-part-ii/ says
"But from the OS’s perspective, your program’s size is the total (maximum) memory allocated to Python. Since Python returns memory to the OS on the heap (that allocates other objects than small objects) only on Windows, if you run on Linux, you can only see the total memory used by your program increase."
And I use Linux. So I wrote the below script to verify it.
import gc
def memory_usage_psutil():
# return the memory usage in MB
import resource
print 'Memory usage: %s (MB)' % (resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1000.0)
def fileopen(fname):
memory_usage_psutil()# 10 MB
f = open(fname)
memory_usage_psutil()# 10 MB
content = f.read()
memory_usage_psutil()# 14 MB
def fun(fname):
memory_usage_psutil() # 10 MB
fileopen(fname)
gc.collect()
memory_usage_psutil() # 14 MB
import sys
from time import sleep
if __name__ == '__main__':
fun(sys.argv[1])
for _ in range(60):
gc.collect()
memory_usage_psutil()#14 MB ...
sleep(1)
The input was a 4MB file. Even after returning from the 'fileopen' function the 4MB memory was not released. I checked htop output while the loop was running, the resident memory stays at 14MB. So unless the process is stopped the memory stays with it.
So if the celery worker is not killed after its task is finished it is going to keep the memory for itself. I know I can use max_tasks_per_child config value to kill the process and spawn a new one. Is there any other way to return the memory to OS from a python process?.
I think your measurement method and interpretation is a bit off. You are using ru_maxrss of resource.getrusage, which is the "high watermark" of the process. See this discussion for details on what that means. In short, it is the peak RAM usage of your process, but not necessarily current. Parts of the process could be swapped out etc.
It also can mean that the process has freed that 4MiB, but the OS has not reclaimed the memory, because it's faster for the process to allocate new 4MiB if it has the memory mapped already. To make it even more complicated programs can and do use "free lists", lists of blocks of memory that are not in active use, but are not freed. This is also a common trick to make future allocations faster.
I wrote a short script to demonstrate the difference between virtual memory usage and max RSS:
import numpy as np
import psutil
import resource
def print_mem():
print("----------")
print("ru_maxrss: {:.2f}MiB".format(
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024))
print("virtual_memory.used: {:.2f}MiB".format(
psutil.virtual_memory().used / 1024 ** 2))
print_mem()
print("allocating large array (80e6,)...")
a = np.random.random(int(80e6))
print_mem()
print("del a")
del a
print_mem()
print("read testdata.bin (~400MiB)")
with open('testdata.bin', 'rb') as f:
data = f.read()
print_mem()
print("del data")
del data
print_mem()
The results are:
----------
ru_maxrss: 22.89MiB
virtual_memory.used: 8125.66MiB
allocating large array (80e6,)...
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8731.85MiB
del a
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8121.66MiB
read testdata.bin (~400MiB)
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8513.11MiB
del data
----------
ru_maxrss: 633.20MiB
virtual_memory.used: 8123.22MiB
It is clear how the ru_maxrss remembers the maximum RSS, but the current usage has dropped in the end.
Note on psutil.virtual_memory().used:
used: memory used, calculated differently depending on the platform and designed for informational purposes only.
So basically I have a list of 300 values and different averages associated with each one.
I have a for-loop that generates a list of ten of these values at random, and writes it to excel if certain conditions are met based on their averages.
The code runs fine if I loop through 10 million times or less, but that is orders of magnitudes too small. Even if I just double the for loop counter to 20 million my computer becomes unusable while it is running.
I want to iterate the loop 100 million or 1 billion times even. I want it to run slowly in the background, I don't care if it takes 24 hours to get to the results. I just want to use my computer while it's working. Currently, if the for loop goes past 10 million the memory and disk usage of my laptop go to 99%.
Using pyScripter and python 3.3
Comp specs: Intel Core i7 4700HQ (2.40GHz) 8GB Memory 1TB HDD NVIDIA GeForce GTX 850M 2GB GDDR3
Code snippet:
for i in range( 0, cycles ):
genRandLineups( Red ); #random team gens
genRandLineups( Blue );
genRandLineups( Purple );
genRandLineups( Green );
if sum( teamAve[i] ) <= 600
and ( ( sum( teamValues[i] ) > currentHighScore )
or sum( teamValues[i] ) > 1024
):
teamValuesF.append( teamValues[i] )
sheetw.write( q, 0, str( teamValues[i] ) )
ts = time.time()
workbookw.save( "Data_Log.xls" )
st = datetime.datetime.fromtimestamp( ts ).strftime( '%Y-%m-%d %H:%M:%S' )
sheetw.write( q, 3, st )
q = q + 1
if sum( teamValues[i] ) > currentHighScore:
currentHighScore = sum( teamValues[i] )
First, I suspect your real problem is that you're just retaining too much memory, causing your computer to run into VM swap, which makes your entire computer slow to a crawl. You should really look into fixing that instead of just trying to make it happen periodically throughout the day instead of constantly.
In particular, it sounds like you're keeping a list of 10N values around forever. Do you really need to do that?
If not, start freeing them. (Or don't store them in the first place. One common problem a lot of people have is that they need 1 billion values, but only one at a time, once through a loop, and they're storing them in a list when they could be using an iterator. This is basically the generic version of the familiar readlines() problem.)
If so, look into some efficient disk-based storage instead of memory, or something more compact like a NumPy array instead of a list.
But meanwhile, if you want to reduce the priority of a program, the easiest way to do that may be externally. For example, on most platforms besides Windows, you can just launch your script with nice 20 python myscript.py and the OS will give everything else more CPU time than your program.
But to answer your direct question, if you want to slow down your script from inside, that's pretty easy to do: Just call sleep every so often. This asks the OS to suspend your program and not give you any resources until the specified number of seconds have expired. That may be only approximate rather than absolutely nothing for exactly N seconds, but it's close enough (and as good as you can do).
For example:
for i in range(reps):
do_expensive_work():
if i % 100 == 99:
time.sleep(10)
If do_expensive_work takes 18ms, you'll burn CPU for 1.8 seconds then sleep for 10 and repeat. I doubt that's exactly the behavior you want (or that it takes 18ms), but you can tweak the numbers. Or, if the timing is variable, and you want the sleep percentage to be consistent, you can measure times and sleep every N seconds since the last sleep, instead of every N reps.
Do not slow down. Rather re-design & step in for HPC
For high performance processing on just "a few" items ( list of 300 values )
the best would consist of:
avoid file access ( even if sparse as noted in OP ) -- cache TruePOSITIVEs in anOutputSTRING that is being fileIO-ed at the end or upon a string length limit or marshalled to another, remote, logging machine.
move all highly iterative processing, said to span 10E+006 -- 10E+009 cycles, onto massively-parallel GPU/CUDA kernel-processing on GPU you already have in the laptop, both to free your CPU-resources and to get benefits of 640-threads delivering about 1.15 TFLOPs of a parallel-engine computing horsepower, as opposed to just a few, GUI-shared, MFLOPs from the CPU-cores.
I am monitoring a file in Python and triggering an action when it reaches a certain size. Right now I am sleeping and polling but I'm sure there is a more elegant way to do this:
POLLING_PERIOD = 10
SIZE_LIMIT = 1 * 1024 * 1024
while True:
sleep(POLLING_PERIOD)
if stat(file).st_size >= SIZE_LIMIT:
# do something
The thing is, if I have a big POLLING_PERIOD, my file limit is not accurate if the file grows quickly, but if I have a small POLLING_PERIOD, I am wasting CPU.
Thanks!
How can I do this?
Thanks!
Linux Solution
You want to look at using pyinotify it is a Python binding for inotify.
Here is an example on watching for close events, it isn't a big jump to listening for size changes.
#!/usr/bin/env python
import os, sys
from pyinotify import WatchManager, Notifier, ProcessEvent, EventsCodes
def Monitor(path):
class PClose(ProcessEvent):
def process_IN_CLOSE(self, event):
f = event.name and os.path.join(event.path, event.name) or event.path
print 'close event: ' + f
wm = WatchManager()
notifier = Notifier(wm, PClose())
wm.add_watch(path, EventsCodes.IN_CLOSE_WRITE|EventsCodes.IN_CLOSE_NOWRITE)
try:
while 1:
notifier.process_events()
if notifier.check_events():
notifier.read_events()
except KeyboardInterrupt:
notifier.stop()
return
if __name__ == '__main__':
try:
path = sys.argv[1]
except IndexError:
print 'use: %s dir' % sys.argv[0]
else:
Monitor(path)
Windows Solution
pywin32 has bindings for file system notifications for the Windows file system.
What you want to look for is using FindFirstChangeNotification and tie into that and list for FILE_NOTIFY_CHANGE_SIZE. This example listens for File Name change it isn't a big leap to listen for size changes.
import os
import win32file
import win32event
import win32con
path_to_watch = os.path.abspath (".")
#
# FindFirstChangeNotification sets up a handle for watching
# file changes. The first parameter is the path to be
# watched; the second is a boolean indicating whether the
# directories underneath the one specified are to be watched;
# the third is a list of flags as to what kind of changes to
# watch for. We're just looking at file additions / deletions.
#
change_handle = win32file.FindFirstChangeNotification (
path_to_watch,
0,
win32con.FILE_NOTIFY_CHANGE_FILE_NAME
)
#
# Loop forever, listing any file changes. The WaitFor... will
# time out every half a second allowing for keyboard interrupts
# to terminate the loop.
#
try:
old_path_contents = dict ([(f, None) for f in os.listdir (path_to_watch)])
while 1:
result = win32event.WaitForSingleObject (change_handle, 500)
#
# If the WaitFor... returned because of a notification (as
# opposed to timing out or some error) then look for the
# changes in the directory contents.
#
if result == win32con.WAIT_OBJECT_0:
new_path_contents = dict ([(f, None) for f in os.listdir (path_to_watch)])
added = [f for f in new_path_contents if not f in old_path_contents]
deleted = [f for f in old_path_contents if not f in new_path_contents]
if added: print "Added: ", ", ".join (added)
if deleted: print "Deleted: ", ", ".join (deleted)
old_path_contents = new_path_contents
win32file.FindNextChangeNotification (change_handle)
finally:
win32file.FindCloseChangeNotification (change_handle)
OSX Solution
There is equivalent hooks into the OSX file system using PyKQueue as well, but if you can understand these examples you can Google for the OSX solution as well.
Here is a good article about Cross Platform File System Monitoring.
You're correct: "Polling is Evil". The more often you poll, the more you waste CPU if nothing happened. If you poll less frequently, you delay handing the event when it does occur.
The only alternative, however, is to "block" until you receive some kind of "signal".
If you're on Linux, you can use "inotify":
http://linux.die.net/man/7/inotify
You're right that polling is generally a sub-optimal solution compared to other ways of accomplishing something. However, sometimes it's the simplest solution, especially if you are trying to write something that will work on both Windows and Linux/UNIX.
Fortunately, modern hardware is quite fast. On my machine, I was able to run your loop, polling ten times a second, without any visible impact on the CPU usage in the Windows Task Manager. 100 times a second did produce some tiny humps on the usage graph and the CPU usage would occasionally reach 1%.
Ten seconds between polling, as in your example, is utterly trivial in terms of CPU usage.
You can also give your script a lower priority to make sure it doesn't affect the performance of other tasks on the machine, if that's important.
A bit of a kludge, but given the CPU load vs. Size accuracy conundrum, would a variable polling period be appropriate?
In a nutshell, the the polling period would decrease as the file size approaches the limit and/or as the current growth rate is above a certain level. In this fashion, at least most of the time, the CPU load cost of the polling would be somewhat reduced.
For example, until the current file size reaches a certain threshold (or a tiers of thresholds), you'd allow for a longer period.
Another heuristic: if the file size didn't change since last time, indicating no current activity on the file, you could further boost the period by a fraction of a second. (Or comparing the timestamp of the file with current time may do?).
The specifics parameters of the function that determine the polling period given the context will depend on the particulars of the way the file growth takes place in general.
What is the absolute minimal amount of time for the file to grow to say 10% of the size limit?
What is the average amount of data written to these files in 24 hours?
Is there much difference in the file growth rate at different times of day and night ?
What is the frequency at which the file size changes?
Is it a "continuous" logging-like activity, whereby on average several dozen of writes take place each second, or are the writes taking place less frequently but putting much more data at once when they do?
etc.
Alternatively, but at the risk of introducing OS-specific logic in your program, you can look into file/directory changes notification systems. I know that both the WIN32 API and Linux offer their flavor of this kind of feature. I'm unaware of the specific implementation the OSes use, and these may well introduce some kind of polling approach similar to what you have externally. Yet, the OS has various hooks into the file system that may allow it a much less intrusive implementation.