I have a python script to analyze user behavior from log file.
This script reads from several large files(about 50 GB each) by using file.readlines(), and then analyze them line by line and save the results in a dict of python object, after all lines are analyzed, the dict is wrote to the disk.
As I have a sever which has 64 cores and 96 GB memory, I start 10 processes of this script and each of which handle part of data. Besides, in order to save the time spent on IO operation, I use file.readlines(MAX_READ_LIMIT) instead of file.readline() and set MAX_READ_LIMIT = 1 GB.
After running this script on sever while using top command to show the task resource, I find that although each process of my script will occupy only about 3.5 GB memory(40 GB in total), there is only 380 MB left on the server (there is no other significant memory-consuming app running on the server at the same time).
So, I was wondering where is the memory? there should be about 96-40=36GB memory left?
please tell me if I make some mistakes on above observations.
One hypothesis is that the memory unused is NOT placed back into memory pool immediately, So I was wondering how to release unused memory explicitly and immediately.
I learned from python document that there are two complementary methods to manage memory in python: garbage collect and reference counting, and according to python doc:
Since the collector supplements the reference counting already used in
Python, you can disable the collector if you are sure your program
does not create reference cycles.
So, which one should I use for my case, del obj or gc.collect() ?
using file.readlines() , then analyze data line by line
This is a bad design. readlines reads the entire file and returns a Python list of strings. If you only need to process the data line-by-line, then iterate through the file without using readlines:
with open(filename) as f:
for line in f:
# process line
This will massively reduce the amount of memory your program requires.
Related
I am using python3.3, for reading a directory that has 10 files each of 20Mb, I am using thread pool executor with max of 10 threads and submitting the files to be read. I am reading a chunk of 1Mb at a time and then storing each lines from all the files to a thread safe list. When I look at the top command the cpu utilization is pretty high approx. 100% above any suggestion to reduce the cpu utilization. Below is the snippet.
all_lines_list = []
while True:
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
for each_file in file_list:
executor.submit(trigger, each_file)
def trigger(filename):
with open(filename, "r")as fp:
buff = fp.read(1000000)
buff_lines = buff.split('\n')
time.sleep(0.2)
for each_line in buff_lines:
all_lines_list.append(each_line)
Try using the list extend method, instead of repeating 1 million appends:
all_lines_list.extend(buff_lines)
instead of
for each_line in buff_lines:
all_lines_list.append(each_line)
If that does not reduce your workload: you are putting your computer to work - 10 x times reading data and storing in memory - and you need the work done - so why the worry it is taking all the processing of one core? If you reduce it to 20% you will get your work done in 5x the time.
You have another problem there in that you are opening the files as text files in Python3 and reading an specific number of characters - that might as well use some CPU as the internals might need to decode each byte to find character boundaries and line separators..
So, if your file is not using a variable-lenght text encoding, like utf-8, it might be worth to open your files in binary-mode, and decode them afterwards (and it even might be worth it to put some strategy in place to deal with variable length characters and make the reading as binary files anyway)
Of course, you could also gain advantages there in using multi-processing instead of Threading - in that way your program would use more than one CPU core to work on the Data. However, python does not have a native multiprocess shared list object - you would need to create your own data structure (and keep it safe with locks) using multiprocess.Value and multiprocess.Array objects. Since you don't have much to process on this data but for adding it to the list, I don't think it is worth the effort.
Each thread uses CPU time to do its share of processing. To reduce the CPU utilization, use fewer threads.
I have one very large custom data structure (similar to a trie, though it's not important to the question) that I'm using to access and serve data from. I'm moving my application to uWSGI for production use now, and I definitely don't want this reloaded per worker. Can I share it among worker processes somehow? I just load the structure once and then reload it once a minute through apscheduler. Nothing any of the workers do modify the data structure in any way. Is there another better solution to this type of problem? Loading the same thing per worker is hugely wasteful.
Depending on the kind of data structure it is, you could try using a memory mapped file. There is a Python library that wraps the relevant system calls.
The file's structure would need to reflect the data structure you are using. For example, if you need a trie, you could store all of the strings in a sorted list and do a binary search for the prefix to see which strings have that prefix.
As you access pages in the file, they will be loaded into memory via the OS's disk read cache. Subsequent requests for the same page will be fast. Because the disk cache can be shared between processes, all of your UWSGI workers will benefit from the speed of accessing cached pages.
I tried this on Linux by forcing it to scan a big file in two separate processes. Create a large file called 'big', then run the following in two separate Python processes:
import mmap
with open('big') as fp:
map = mmap.mmap(fp.fileno(), 0, mmap.MAP_PRIVATE)
if x == 'a': # Make sure 'a' doesn't occur in the file!
break
You will notice that the resident memory of the two processes grows as they scan the file, however, so does the shared memory usage. For example, if big is a 1 gb file, both processes will appear to be using about 1 gb of memory. However, the overall memory load on the system will be increased by only 1 gb, not 2 gb.
Obviously there are some limitations to this approach, chiefly that the data structure you are looking to share is easily represented in a binary format. Also, Python needs to copy any bytes from the file into memory whenever you access them. This can cause aggressive garbage collection if you frequently read through the entire file in small pieces, or undermine the shared memory benefit of the memory map if you read large pieces.
I believe I am about to ask a definite newbie question, but here goes:
I written a python script that does snmp queries. The snmp query function uses a global list as its output.
def get_snmp(community_string, mac_ip):
global output_list
snmp get here
output_list.append(output_string)
The get_snmp querier's are launched using the following code:
pool.starmap_async(get_snmp, zip(itertools.repeat(COMMUNITY_STRING), input_list))
pool.close()
pool.join()
if output_file_name != None:
csv_writer(output_list, output_file_name)
This setup works fine, all of the get_snmp process write their output out to a shared list output_list, and then the csv_write function is called and that list is dumped to disk.
The main issue with this program is on a large run the memory usage can become quite high as the list is being built. I would like to write the results to the text file in the background to keep memory usage down, and I'm not sure how to do it. I went with the global list to eliminate file locking issues.
I think that your main problem with increasing memory usage is that you don't remove contents from that list when writing them to file.
Maybe you should do del output_list[:] after writing it to file.
Have each of the workers write their output to a Queue, then have another worker (or the main thread) read from the Queue and write to a file. That way you don't have to store everything in memory.
Don't write directly to the file from the workers; otherwise you can have issues with multiple processes trying to write to the same file at the same time, which will just give you a headache until you fix it anyway.
I have multiprocessing code wherein each process does a disk write (pickling data), and the resulting pickle files can be upwards of 50 MB (and sometimes even more than 1 GB depending on what I'm doing). Also, different processes are not writing to the same file, each process writes a separate file (or set of files).
Would it be a good idea to implement a lock around disk writes so that only one process is writing to the disk at a time? Or would it be best to just let the operating system sort it out even if that means 4 processes may be trying to write 1 GB to the disk at the same time?
As long as the processes aren't fighting over the same file; let the OS sort it out. That's its job.
Unless your processes try and dump their data in one big write, the OS is in a better position to schedule disk writes.
If you do use one big write, you mighy try and partition it in smaller chunks. That might give the OS a better chance of handling them.
Of course you will hit a limit somewhere. Your program might be the CPU-bound, memory-bound or disk-bound. It might hit different limits depending on the input or load.
But unless you've got evidence that you're constantly disk-bound and you've got a good idea how to solve that, I'd say don't bother. Because the days that a write system call actuall meant that the data was directly sent to disk are long gone.
Most operating systems these days use unallocated RAM as a disk cache. And HDD's have built-in caches as well. Unless you disable both of these (which will give you a huge performance hit) there is precious little connection between your program completing a write and and the data actually hitting the plates or flash.
You might consider using memmap (if your OS supports it), and let the OS's virtual memory do the work for you. See e.g. the architect notes for the Varnish cache.
I'm trying to debug a memory leak. It's a script that runs as a daemon, and has dependencies in about 10K lines of code in 30 different files. After a while you see the memory usage start creeping up.
I used heapy to determine that that it's a dict that's growing, but how can I find out which file that dict lives in?