I have multiprocessing code wherein each process does a disk write (pickling data), and the resulting pickle files can be upwards of 50 MB (and sometimes even more than 1 GB depending on what I'm doing). Also, different processes are not writing to the same file, each process writes a separate file (or set of files).
Would it be a good idea to implement a lock around disk writes so that only one process is writing to the disk at a time? Or would it be best to just let the operating system sort it out even if that means 4 processes may be trying to write 1 GB to the disk at the same time?
As long as the processes aren't fighting over the same file; let the OS sort it out. That's its job.
Unless your processes try and dump their data in one big write, the OS is in a better position to schedule disk writes.
If you do use one big write, you mighy try and partition it in smaller chunks. That might give the OS a better chance of handling them.
Of course you will hit a limit somewhere. Your program might be the CPU-bound, memory-bound or disk-bound. It might hit different limits depending on the input or load.
But unless you've got evidence that you're constantly disk-bound and you've got a good idea how to solve that, I'd say don't bother. Because the days that a write system call actuall meant that the data was directly sent to disk are long gone.
Most operating systems these days use unallocated RAM as a disk cache. And HDD's have built-in caches as well. Unless you disable both of these (which will give you a huge performance hit) there is precious little connection between your program completing a write and and the data actually hitting the plates or flash.
You might consider using memmap (if your OS supports it), and let the OS's virtual memory do the work for you. See e.g. the architect notes for the Varnish cache.
Related
I work for a digital marketing agency having multiple clients. And in one of the projects, I have a very resource intensive python script (which fetches data for Facebook ads), to be run on all those clients (say 500+ in number) in ubuntu 16.04 server.
Originally script took around 2 mins to complete, with 300 MB RES & 1000 MB VM (as per htop), for 1 client. Hence optimized it with ThreadPoolExecutor (max_workers=10) so that script can run on 4 clients concurrently (almost).
Then found out that sometimes, script froze during run (or basically its in "comatose state"). Debugged & profiled and found that its not the script that's causing issue, but its the system.
Then batched the script, means if there are 20 input clients, ran 5 instances (4*5=20) of script. Here sometimes it went fine but sometimes last instance froze.
Then found out that RAM (2G) was being overused, hence increased swapping memory from 0 to 1G. That did the trick. But if few clients are heavy in memory, same thing happens.
Have attached the screenshot of the latest run where after running the 8 instances, last 2 froze. They can be left for days for that matter.
I am thinking of increasing the server RAM from 2G to 4G but not sure if that's the permanent solution. Did anyone has faced similar issue?
You need to fix the Ram consumption of your script,
if your script allocates more memory than your system can provide it get's memory errors, in case you have them in threadpools or similar constructs the threads may never return under some circumstances.
You can fix this by using async functions with timeouts and implementing automatic restart handlers, in case a process does not yield an expected results.
The best way to do that is heavily dependent on the script and will probably require altering already created code
The issue is definitly with your script and not with the OS.
The fastetst workaround would be to increase she system memory or to reduce the amount of threads.
If just adding 1GB of swap area "almost" did the trick then definitely increasing the physical memory is a good way to go. Btw remember that swapping means you're using disk storage, whose speed is measured in millisecs, while RAM speed is measured in nanosecs - so avoiding swap guarantees a performance boost.
And then, reboot your system every now and then. Although Linux is far better than Windows in this respect, memory leaks do occur in Linux too, and a reboot every few months will surely help.
As Gornoka stated you need to alter the memory comsumption of the script as added details this can also be done by removing declared variables within the script once used with the keyword
del
This can also be done by ensuring that if it is processing massive files it does this line by line and saving it as it finishes each line.
I have had this happen and it usually is an indicator of working with to much data at once within the ram and it is always better to work with it partially whenever possible and if not possible get more RAM
I have one very large custom data structure (similar to a trie, though it's not important to the question) that I'm using to access and serve data from. I'm moving my application to uWSGI for production use now, and I definitely don't want this reloaded per worker. Can I share it among worker processes somehow? I just load the structure once and then reload it once a minute through apscheduler. Nothing any of the workers do modify the data structure in any way. Is there another better solution to this type of problem? Loading the same thing per worker is hugely wasteful.
Depending on the kind of data structure it is, you could try using a memory mapped file. There is a Python library that wraps the relevant system calls.
The file's structure would need to reflect the data structure you are using. For example, if you need a trie, you could store all of the strings in a sorted list and do a binary search for the prefix to see which strings have that prefix.
As you access pages in the file, they will be loaded into memory via the OS's disk read cache. Subsequent requests for the same page will be fast. Because the disk cache can be shared between processes, all of your UWSGI workers will benefit from the speed of accessing cached pages.
I tried this on Linux by forcing it to scan a big file in two separate processes. Create a large file called 'big', then run the following in two separate Python processes:
import mmap
with open('big') as fp:
map = mmap.mmap(fp.fileno(), 0, mmap.MAP_PRIVATE)
if x == 'a': # Make sure 'a' doesn't occur in the file!
break
You will notice that the resident memory of the two processes grows as they scan the file, however, so does the shared memory usage. For example, if big is a 1 gb file, both processes will appear to be using about 1 gb of memory. However, the overall memory load on the system will be increased by only 1 gb, not 2 gb.
Obviously there are some limitations to this approach, chiefly that the data structure you are looking to share is easily represented in a binary format. Also, Python needs to copy any bytes from the file into memory whenever you access them. This can cause aggressive garbage collection if you frequently read through the entire file in small pieces, or undermine the shared memory benefit of the memory map if you read large pieces.
I have a simple string matching script that tests just fine for multiprocessing with up to 8 Pool workers on my local mac with 4 cores. However, the same script on an AWS c1.xlarge with 8 cores generally kills all but 2 workers, the CPU only works at 25%, and after a few rounds stops with MemoryError.
I'm not too familiar with server configuration, so I'm wondering if there are any settings to tweak?
The pool implementation looks as follows, but doesn't seem to be the issue as it works locally. There would be several thousand targets per worker, and it doesn't run past the first five or so. Happy to share more of the code if necessary.
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch, itertools.izip(itertools.repeat(targetsPerBatch), xrange(0, totalTargets, targetsPerBatch))).get(99999999)
pool.close()
pool.join()
The MemoryError means you're running out of system-wide virtual memory. How much virtual memory you have is an abstract thing, based on the actual physical RAM plus swapfile size plus stuff that's paged into memory from other files and stuff that isn't paged anywhere because the OS is being clever and so on.
According to your comments, each process averages 0.75GB of real memory, and 4GB of virtual memory. So, your total VM usage is 32GB.
One common reason for this is that each process might peak at 4GB, but spend almost all of its time using a lot less than that. Python rarely releases memory to the OS; it'll just get paged out.
Anyway, 6GB of real memory is no problem on an 8GB Mac or a 7GB c1.xlarge instance.
And 32GB of VM is no problem on a Mac. A typical OS X system has virtually unlimited VM size—if you actually try to use all of it, it'll start creating more swap space automatically, paging like mad, and slowing your system to a crawl and/or running out of disk space, but that isn't going to affect you in this case.
But 32GB of VM is likely to be a problem on linux. A typical linux system has fixed-size swap, and doesn't let you push the VM beyond what it can handle. (It has a different trick that avoids creating probably-unnecessary pages in the first place… but once you've created the pages, you have to have room for them.) I'm not sure what an xlarge comes configured for, but the swapon tool will tell you how much swap you've got (and how much you're using).
Anyway, the easy solution is to create and enable an extra 32GB swapfile on your xlarge.
However, a better solution would be to reduce your VM use. Often each subprocess is doing a whole lot of setup work that creates intermediate data that's never needed again; you can use multiprocessing to push that setup into different processes that quit as soon as they're done, freeing up the VM. Or maybe you can find a way to do the processing more lazily, to avoid needing all that intermediate data in the first place.
I am using ubuntu. I have some management commands which when run, does lots of database manipulations, so it takes nearly 15min.
My system monitor shows that my system has 4 cpu's and 6GB RAM. But, this process is not utilising all the cpu's . I think it is using only one of the cpus and that too very less ram. I think, if I am able to make it to use all the cpu's and most of the ram, then the process will be completed in very less time.
I tried renice , by settings priority to -18 (means very high) but still the speed is less.
Details:
its a python script with loop count of nearly 10,000 and that too nearly ten such loops. In every loop, it saves to postgres database.
If you are looking to make this application run across multiple cpu's then there are a number of things you can try depending on your setup.
The most obvious thing that comes to mind is making the application make use of threads and multiprocesses. This will allow the application to "do more" at once. Obviously the issue you might have here is concurrent database access so you might need to use transactions (at which point you might loose the advantage of using multiprocesses in the first place).
Secondly, make sure you are not opening and closing lots of database connections, ensure your application can hold the connection open for as long as it needs.
thirdly, Ensure the database is correctly indexed. If you are doing searches on large strings then things are going to be slow.
Fourthly, Do everything you can in SQL leaving little manipulation to python, sql is horrendously quick at doing data manipulation if you let it. As soon as you start taking data out of the database and into code then things are going to slow down big time.
Fifthly, make use of stored procedures which can be cached and optimized internally within the database. These can be a lot quicker than application built queries which cannot be optimized as easily.
Sixthly, dont save on each iteration of a program. Try to produce a batch styled job whereby you alter a number of records then save all of those in one batch job. This will reduce the amount of IO on each iteration and speed up the process massivly.
Django does support the use of a bulk update method, there was also a question on stackoverflow a while back about saving multiple django objects at once.
Saving many Django objects with one big INSERT statement
Django: save multiple object signal once
Just in case, did you run the command renice -20 -p {pid} instead of renice --20 -p {pid}? In the first case it will be given the lowest priority.
I need to know what mechanism is more efficient (less RAM/CPU) to read and write files, especially write. Possibly a JSON data structure. The idea is perform these operations in a context of WebSockets (client -> server -> read/write file with data of the actual session -> response to client).... Best idea is to store the data in temporal variables and destroy vars when are not useful?
Any idea?
They will probably both be about the same. I/O is generally a lot slower than CPU, so the entire process of reading and writing files will depend on how fast your disk can handle the requests.
It also will depend on the data-processing approach you take. If you opt to read the whole file in at once, then of course it will use more memory than if you choose to read the file piece-by-piece.
So, the answer is: the performance will only (very minimally) depend on your choice of language. Choice of algorithm and I/O performance will easily account for the majority of CPU or RAM usage.