Reducing CPU Utilization - python

I am using python3.3, for reading a directory that has 10 files each of 20Mb, I am using thread pool executor with max of 10 threads and submitting the files to be read. I am reading a chunk of 1Mb at a time and then storing each lines from all the files to a thread safe list. When I look at the top command the cpu utilization is pretty high approx. 100% above any suggestion to reduce the cpu utilization. Below is the snippet.
all_lines_list = []
while True:
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
for each_file in file_list:
executor.submit(trigger, each_file)
def trigger(filename):
with open(filename, "r")as fp:
buff = fp.read(1000000)
buff_lines = buff.split('\n')
time.sleep(0.2)
for each_line in buff_lines:
all_lines_list.append(each_line)

Try using the list extend method, instead of repeating 1 million appends:
all_lines_list.extend(buff_lines)
instead of
for each_line in buff_lines:
all_lines_list.append(each_line)
If that does not reduce your workload: you are putting your computer to work - 10 x times reading data and storing in memory - and you need the work done - so why the worry it is taking all the processing of one core? If you reduce it to 20% you will get your work done in 5x the time.
You have another problem there in that you are opening the files as text files in Python3 and reading an specific number of characters - that might as well use some CPU as the internals might need to decode each byte to find character boundaries and line separators..
So, if your file is not using a variable-lenght text encoding, like utf-8, it might be worth to open your files in binary-mode, and decode them afterwards (and it even might be worth it to put some strategy in place to deal with variable length characters and make the reading as binary files anyway)
Of course, you could also gain advantages there in using multi-processing instead of Threading - in that way your program would use more than one CPU core to work on the Data. However, python does not have a native multiprocess shared list object - you would need to create your own data structure (and keep it safe with locks) using multiprocess.Value and multiprocess.Array objects. Since you don't have much to process on this data but for adding it to the list, I don't think it is worth the effort.

Each thread uses CPU time to do its share of processing. To reduce the CPU utilization, use fewer threads.

Related

How to know when the disk is ready after calling flush() on the stream?

I have a file called foo.txt with a 1B rows.
I want to apply an operation on each row that produces 10 new rows. Output is expected to be around 10B rows.
To increase the speed and the IO, foo.txt is on DiskA and bar.txt DiskB (different drives - physically speaking).
DiskB is going to be the limiting factor. Because there's a lot of rows to write, I added a large buffer when I write to DiskB.
My question is: When I call flush() on diskB, the buffer of the file handler will flush it to the hard drive. It seems to be a non-blocking call since the command returns but I can still see that the disk is writing and its busy indicator is 100%. After a couple of seconds, the indicator goes back to 0%. Is there a way in python to wait for the disk to finish? Ideally, I would like flush() to be a blocking call. The only solution I see now is to add arbitrary sleep() and hopes that the disk is ready.
Here's a snippet to show visually (It's a bit more complicated in practice as bar.txt is not just one file but thousands of files so the IO efficiency is very poor):
with open('bar.txt', 'w', buffering=100 * io.DEFAULT_BUFFER_SIZE) as w:
with open('foo.txt') as r:
for line in r:
# writes each line of foo 10 times in bar.
for i in range(10):
w.write(line)
# w.flush()
I think there are several issues.
"when the disk is ready" had to be defined very clearly.
your OS, file system and configuration might be important
your exact use case might be important
To know when data is written to the disk will have different answers depending on the OS and the file system and the OS/filesystem configuration.
Please note that following situations are not (or might not be) identical:
to know when your process can write the next bytes without blocking
know when another process can read the bytes from a file (might be after a flush)
know when the OS wrote the last byte to the disk controller (when all write caches are flushed)
know when you could cut the power without losing data / when data was really written to the disk (when your disk controller flushed out its buffers)
The main question is really, why exactly you have to know when the last byte has been written?
If your motivation is performance, then perhaps it's just good enough to have two threads:
a reading and processing thread who places the data to write in a Queue (threading.Queue) with a max amount of entries. That means, when the queue reaches a certain size the reading/processing thread will be blocked
a writing thread that just reads from the queue and writes to disk.
If above is the case and you never used threading and threading Queue I can enhance my answer. Just tell me.
However if you say, that writing / flushing is not / never blocking, then
this wouldn't help.
Just for fun you could implement above threads and check with a third thread periodically the size of the queue to see whether writing is really the bottle neck. If it were, then your FIFO should be (almost) full most of the time.
Comment after first feedback:
Youre running on linux with an SSD drive with ext4 to write to.
It seems but I'm still not sure, that a more representative example than the one in the question would be a script just writing to N files in an alternating manner with different data rates.
I still have the impression, that increased write buffer sizes and letting the OS do the rest should give you a performance, that is difficult to improve by manual interventions.
Disabling journaling on the disk might improve performance though
writers = []
writers.append((open("f1", "w", buffering=100 * io.DEFAULT_BUFFER_SIZE), "a")
writers.append((open("f2", "w", buffering=100 * io.DEFAULT_BUFFER_SIZE), "b" * 10000)
writers.append((open("f3", "w", buffering=100 * io.DEFAULT_BUFFER_SIZE), "c" * 100)
...
writers.append((open("f1000", "w", buffering=100 * io.DEFAULT_BUFFER_SIZE), "a" * 200)
for n in range(x):
# Normally this is where you would read data from a file,
# analyse the data and write some data to one or multiple writers
# as a very approximate simulation I just write to the writers in
# data chunks in alternating order
for writer, data in writers:
writer.write(data)
# this is the question:
# Can I write lines of the following nature, that will increase
# the write rate?
if some_condition:
writer.flush()
Would this model your problem? (I know in reality the write rates of a writer won't be constant and the order in which writers would write are random)
I have the impression I am missing something.
Why should these flushes accelerate anything?
This is an SSD. It doesn't have any mechanical delays for waiting, that the disk spun to a certain place. the buffering will only write to a file if you have 'enough data worth being written'.
What I'm also confused about is, that you say flush() is non blocking.
A buffered write is just putting data in a buffer and calling flush() when the buffer is full, this means, that write() would also be non blocking.
If everything is non blocking, than your process would loose 0 time for writing and there wouldn't be anything to optimize.
So I guess, that write() and flush() are blocking calls, but not blocking in the way you expect them to block. They are probably blocking until the OS accepted the data for writing, (which does not mean, the data has been written)
The real writes to the disk will happen whenever the OS decides to do so.
There are write caches involved, the disk controller might add some other layers of write caching / write reordering.
in order to check this you can add debug code around every write of following kind.
import time
global t_max = 0
...
# this had to be done for every `write` or `flush`
# or at least for some representative calls of them
t0 = time.time()
bla.write()
t = time.time() - t0
t_max = max(t_max, t)
You will probably have a t_max, that indicates, that .write is blocking
....

Read a file using threads

I try to write a python program that send files from one PC to another using python's sockets. But when file size increase it takes lots of time. Is it possible to read lines of a file sequentially using threads?
The concepts which I think is as follows:
Each thread separately and sequentially read lines from file and send it over socket. Is it possible to do? Or do you have any suggestion for it?
First, if you want to speed this up as much as possible without using threads, reading and sending a line at a time can be pretty slow. Python does a great job of buffering up the file to give you a line at a time for reading, but then you're sending tiny 72-byte packets over the network. You want to try to send at least 1.5KB at a time when possible.
Ideally, you want to use the sendfile method. Python will tell the OS to send the whole file over the socket in whatever way is most efficient, without getting your code involved at all. Unfortunately, this doesn't work on Windows; if you care about that, you may want to drop to the native APIs1 directly with pywin32 or switch to a higher-level networking library like twisted or asyncio.
Now, what about threading?
Well, reading a line at a time in different threads is not going to help very much. The threads have to read sequentially, fighting over the read pointer (and buffer) in the file object, and they presumably have to write to the socket sequentially, and you probably even need a mutex to make sure they write things in order. So, whichever one of those is slowest, all of your threads are going to end up waiting for their turn.2
Also, even forgetting about the sockets: Reading a file in parallel can be faster in some situations on modern hardware, but in general it's actually a lot slower. Imagine the file is on a slow magnetic hard drive. One thread is trying to read the first chunk, the next thread is trying to read the 64th chunk, the next thread is trying to read the 4th chunk… this means you spend more time seeking the disk head back and forth than actually reading data.
But, if you think you might be in one of those situations where parallel reads might help, you can try it. It's not trivial, but it's not that hard.
First, you want to do binary reads of fixed-size chunks. You're going to need to experiment with different sizes—maybe 4KB is fastest, maybe 1MB… so make sure to make it a constant you can easily change in just one place in the code.
Next, you want to be able to send the data as soon as you can get it, rather than serializing. This means you have to send some kind of identifier, like the offset into the file, before each chunk.
The function will look something like this:
def sendchunk(sock, lock, file, offset):
with lock:
sock.send(struct.pack('>Q', offset)
sent = sock.sendfile(file, offset, CHUNK_SIZE)
if sent < CHUNK_SIZE:
raise OopsError(f'Only sent {sent} out of {CHUNK_SIZE} bytes')
… except that (unless your files actually are all multiples of CHUNK_SIZE) you need to decide what you want to do for a legitimate EOF. Maybe send the total file size before any of the chunks, and pad the last chunk with null bytes, and have the receiver truncate the last chunk.
The receiving side can then just loop reading 8+CHUNK_SIZE bytes, unpacking the offset, seeking, and writing the bytes.
1. See TransmitFile—but in order to use that, you have to know about how to go between Python-level socket objects and Win32-level HANDLEs, and so on; if you've never done that, there's a learning curve—and I don't know of a good tutorial to get you started..
2. If you're really lucky, and, say, the file reads are only twice as fast as the socket writes, you might actually get a 33% speedup from pipelining—that is, only one thread can be writing at a time, but the threads waiting to write have mostly already done their reading, so at least you don't need to wait there.
Not Threads.
source_path = r"\\mynetworkshare"
dest_path = r"C:\TEMP"
file_name = "\\myfile.txt"
shutil.copyfile(source_path + file_name, dest_path + file_name)
https://docs.python.org/3/library/shutil.html
Shutil offers a high level copy function that uses the OS layer to copy. It is your best bet for this scenario.

Preventing multiple loading of large objects in uWSGI workers?

I have one very large custom data structure (similar to a trie, though it's not important to the question) that I'm using to access and serve data from. I'm moving my application to uWSGI for production use now, and I definitely don't want this reloaded per worker. Can I share it among worker processes somehow? I just load the structure once and then reload it once a minute through apscheduler. Nothing any of the workers do modify the data structure in any way. Is there another better solution to this type of problem? Loading the same thing per worker is hugely wasteful.
Depending on the kind of data structure it is, you could try using a memory mapped file. There is a Python library that wraps the relevant system calls.
The file's structure would need to reflect the data structure you are using. For example, if you need a trie, you could store all of the strings in a sorted list and do a binary search for the prefix to see which strings have that prefix.
As you access pages in the file, they will be loaded into memory via the OS's disk read cache. Subsequent requests for the same page will be fast. Because the disk cache can be shared between processes, all of your UWSGI workers will benefit from the speed of accessing cached pages.
I tried this on Linux by forcing it to scan a big file in two separate processes. Create a large file called 'big', then run the following in two separate Python processes:
import mmap
with open('big') as fp:
map = mmap.mmap(fp.fileno(), 0, mmap.MAP_PRIVATE)
if x == 'a': # Make sure 'a' doesn't occur in the file!
break
You will notice that the resident memory of the two processes grows as they scan the file, however, so does the shared memory usage. For example, if big is a 1 gb file, both processes will appear to be using about 1 gb of memory. However, the overall memory load on the system will be increased by only 1 gb, not 2 gb.
Obviously there are some limitations to this approach, chiefly that the data structure you are looking to share is easily represented in a binary format. Also, Python needs to copy any bytes from the file into memory whenever you access them. This can cause aggressive garbage collection if you frequently read through the entire file in small pieces, or undermine the shared memory benefit of the memory map if you read large pieces.

Python Multiprocessing: is locking appropriate for (large) disk writes?

I have multiprocessing code wherein each process does a disk write (pickling data), and the resulting pickle files can be upwards of 50 MB (and sometimes even more than 1 GB depending on what I'm doing). Also, different processes are not writing to the same file, each process writes a separate file (or set of files).
Would it be a good idea to implement a lock around disk writes so that only one process is writing to the disk at a time? Or would it be best to just let the operating system sort it out even if that means 4 processes may be trying to write 1 GB to the disk at the same time?
As long as the processes aren't fighting over the same file; let the OS sort it out. That's its job.
Unless your processes try and dump their data in one big write, the OS is in a better position to schedule disk writes.
If you do use one big write, you mighy try and partition it in smaller chunks. That might give the OS a better chance of handling them.
Of course you will hit a limit somewhere. Your program might be the CPU-bound, memory-bound or disk-bound. It might hit different limits depending on the input or load.
But unless you've got evidence that you're constantly disk-bound and you've got a good idea how to solve that, I'd say don't bother. Because the days that a write system call actuall meant that the data was directly sent to disk are long gone.
Most operating systems these days use unallocated RAM as a disk cache. And HDD's have built-in caches as well. Unless you disable both of these (which will give you a huge performance hit) there is precious little connection between your program completing a write and and the data actually hitting the plates or flash.
You might consider using memmap (if your OS supports it), and let the OS's virtual memory do the work for you. See e.g. the architect notes for the Varnish cache.

Optimize Memory Usage in Python: del obj or gc.collect()?

I have a python script to analyze user behavior from log file.
This script reads from several large files(about 50 GB each) by using file.readlines(), and then analyze them line by line and save the results in a dict of python object, after all lines are analyzed, the dict is wrote to the disk.
As I have a sever which has 64 cores and 96 GB memory, I start 10 processes of this script and each of which handle part of data. Besides, in order to save the time spent on IO operation, I use file.readlines(MAX_READ_LIMIT) instead of file.readline() and set MAX_READ_LIMIT = 1 GB.
After running this script on sever while using top command to show the task resource, I find that although each process of my script will occupy only about 3.5 GB memory(40 GB in total), there is only 380 MB left on the server (there is no other significant memory-consuming app running on the server at the same time).
So, I was wondering where is the memory? there should be about 96-40=36GB memory left?
please tell me if I make some mistakes on above observations.
One hypothesis is that the memory unused is NOT placed back into memory pool immediately, So I was wondering how to release unused memory explicitly and immediately.
I learned from python document that there are two complementary methods to manage memory in python: garbage collect and reference counting, and according to python doc:
Since the collector supplements the reference counting already used in
Python, you can disable the collector if you are sure your program
does not create reference cycles.
So, which one should I use for my case, del obj or gc.collect() ?
using file.readlines() , then analyze data line by line
This is a bad design. readlines reads the entire file and returns a Python list of strings. If you only need to process the data line-by-line, then iterate through the file without using readlines:
with open(filename) as f:
for line in f:
# process line
This will massively reduce the amount of memory your program requires.

Categories

Resources