I have created a simple substring search program that recursively looks through a folder and scans a large number of files. The program uses the Boyer-Moore-Horspool algorithm and is very efficient at parsing large amounts of data.
Link to program: http://pastebin.com/KqEMMMCT
What I'm trying to do now is make it even more efficient. If you look at the code, you'll notice that there are three different directories being searched. I would like to be able to create a process/thread that searches each directory concurrently, it would greatly speed up my program.
What is the best way to implement this? I have done some preliminary research, but my implementations have been unsuccessful. They seem to die after 25 minutes or so of processing (right now the single process version takes nearly 24 hours to run; it's a lot of data, and there are 648 unique keywords.)
I have done various experiments using the multiprocessing API and condensing all the various files into 3 files (one for each directory) and then mapping the files to memory via mmap(), but a: I'm not sure if this is the appropriate route to go, and b: my program kept dying at random points, and debugging was an absolute nightmare.
Yes, I have done extensive googleing, but I'm getting pretty confused between pools/threads/subprocesses/multithreading/multiprocessing.
I'm not asking for you to write my program, just help me understand the thought process needed to go about implementing a solution. Thank you!
FYI: I plan to open-source the code once I get the program running. I think it's a fairly useful script, and there are limited examples of real world implementations of multiprocessing available online.
What to do depends on what's slowing down the process.
If you're reading on a single disk, and disk I/O is slowing you down, multiple threads/process will probably just slow you down as the read head will now be jumping all over the place as different threads get control, and you'll be spending more time seeking than reading.
If you're reading on a single disk, and processing is slowing you down, then you might get a speedup from using multiprocessing to analyze the data, but you should still read from a single thread to avoid seek time delays (which are usually very long, multiple milliseconds).
If you're reading from multiple disks, and disk I/O is slowing you down, then either multiple threads or processes will probably give you a speed improvement. Threads are easier, and since most of your delay time is away from the processor, the GIL won't be in your way.
If you're reading from multiple disks,, and processing is slowing you down, then you'll need to go with multiprocessing.
Multiprocessing is easier to understand/use than multithreading(IMO). For my reasons, I suggest reading this section of TAOUP. Basically, everything a thread does, a process does, only the programmer has to do everything that the OS would handle. Sharing resources (memory/files/CPU cycles)? Learn locking/mutexes/semaphores and so on for threads. The OS does this for you if you use processes.
I would suggest building 4+ processes. 1 to pull data from the hard drive, and the other three to query it for their next piece. Perhaps a fifth process to stick it all together.
This naturally fits into generators. See the genfind example, along with the gengrep example that uses it.
Also on the same site, check out the coroutines section.
Related
I need to get several processes to communicate with each others on a macOS system. These processes will be spawned several times per day at different times, and I cannot predict when they will be up at the same time (if ever). These programs are in python or swift.
How can I safely allow these programs to all write to the same file?
I have explored a few different options:
I thought of using sqlite3, however I couldn't find an answer in the documentation on whether it was safe to write concurrently across processes. This question is not very definitive, old, and I would ideally like to get a more authoritative answer.
I thought of using multiprocesing as it supports locks. However, as far as I could see in the documentation, you need a meta-process that spawns the children and stays up for the duration of the longest child process. I am am fine having a meta-spawner process, but it feels wasteful to have a meta-process basically staying up all day long, just to resolve conflicting access ?
Along the lines of this, I thought of having a process that stays up all day long, and receive messages from all other processes, and is the sole responsible for writing to file. It feels a little wasteful, how worried should I be about the resource cost of having a program up all day, and doing little? Are the only thing to worry about memory footprint and CPU usage (as shown in activity monitor), or could there be other significant costs, e.g. context switching?
I have come across flock on linux, that seems to be a locking mechanism to access files, provided by the OS. This seems like a good solution, but this does not seem to exist on macOS?
Any idea to solve this requirement in a robust manner (so that I don't have to debug every other day - I now concurrency can be a pain), is most welcome!
While you are in control of the source code of all such processes, you could use flock. It will put the advisory lock on file, so the other writer will be blocked only in case he is also access the file the same way. This is OK for you, if only your processes will ever need to write to the shared file.
I've tested flock on BigSur, it is still implemented and works fine.
You can also do it in any other common manner: create temporary .lock file in the known location(this is what git does), and remove it after current writer is done with the main file; use semaphores; etc
I'm writing a program that requires running algorithms on a very large (~6GB) csv file, which is loaded with pandas using read_csv().
The issue I have now, is that anytime I tweak my algorithms and need to re-simulate (which is very often), I need to wait ~30s for the dataset to load into memory, and then another 30s afterward to load the same dataset into a graphing module so I can visually see what's going on. Once it's loaded however, operations are done very quickly.
So far I've tried using mmap, and loading the dataset into a RAM disk for access, with no improvement.
I'm hoping to find a way to load up the dataset once into memory with one process, and then access it in memory with the algorithm-crunching process, which gets re-run each time I make a change.
This thread seems to be close-ish to what I need, but uses multiprocessing which needs everything to be run within the same context.
I'm not a computer engineer (I'm electrical :), so I'm not sure what I'm asking for is even possible. Any help would be appreciated however.
Thanks,
Found a solution that worked, although it was not directly related to my original ask.
Instead of loading a large file into memory and sharing between independent processes, I found that the bottleneck was really the parsing function in pandas library.
Particularly, CSV parsing, as CSVs are notoriously inefficient in terms of data storage.
I started storing my files in the python-native pickle format, which is supported by pandas through the to_pickle() and read_pickle() functions. This cut my load times drastically from ~30s to ~2s.
I'm using feedparser to print the top 5 Google news titles. I get all the information from the URL the same way as always.
x = 'https://news.google.com/news/feeds?pz=1&cf=all&ned=us&hl=en&topic=t&output=rss'
feed = fp.parse(x)
My problem is that I'm running this script when I start a shell, so that ~2 second lag gets quite annoying. Is this time delay primarily from communications through the network, or is it from parsing the file?
If it's from parsing the file, is there a way to only take what I need (since that is very minimal in this case)?
If it's from the former possibility, is there any way to speed this process up?
I suppose that a few delays are adding up:
The Python interpreter needs a while to start and import the module
Network communication takes a bit
Parsing probably consumes only little time but it does
I think there is no straightforward way of speeding things up, especially not the first point. My suggestion is that you have your feeds downloaded on a regularly basis (you could set up a cron job or write a Python daemon) and stored somewhere on your disk (i.e. a plain text file) so you just need to display them at your terminal's startup (echo would probably be the easiest and fastest).
I personally made good experiences with feedparser. I use it to download ~100 feeds every half hour with a Python daemon.
Parse at real time not better case if you want faster result.
You can try does it asynchronously by Celery or by similar other solutions. I like the Celery, it gives many abilities. There are abilities as task as the cron or async and more.
I was given a task to process a large collection of audiofiles. Each file must be processed in four steps:
convertion from .wav into raw pcm,
resampling,
quantization
coding with one of three speech codecs.
Each step corresponds to a program taking a file as input and returning a file as output. Processing file by file seems to take long. How can I optimize the procedure? E.g. parrallel programming or something? I tried to make use of ramdisk to reduce the time spent to file reading/writing but it didn't give improvement. (Why?)
I'm writing in Python under Ubuntu Linux. Thanks in advance.
Reading and writing to disk is pretty slow. If each program result is being written to disk then it would be better to stop that from happening. Sockets seem like a good fit to me. Read more here: http://docs.python.org/library/ipc.html
Parallel program is nice... need more info before I can say much more on this topic. I remember reading a while ago about python not handling threading so efficiently, so that might not be the best bet. As far as I recall it just emulates parallel processing by switching between tasks really gosh darn quickly. so that wont help. This may have changed since I've worked with threading.... Extra processes on the other hand sound like a good idea.
If you need a less-vague answer please supply specifics in your question.
EDIT
The thing i read a while ago about threads looks like this: http://docs.python.org/2/glossary.html#term-global-interpreter-lock
I've read that certain Python functions implemented in C, which I assume includes file.read(), can release the GIL while they're working and then get it back on completion and by doing so make use of multiple cores if they're available.
I'm using multiprocess to parallelize some code and currently I've got three processes, the parent, one child that reads data from a file, and one child that generates a checksum from the data passed to it by the first child process.
Now if I'm understanding this right, it seems that creating a new process to read the file as I'm currently doing is uneccessary and I should just call it in the main process. The question is am I understanding this right and will I get better performance with the read kept in the main process or in a separate one?
So given my function to read and pipe the data to be processed:
def read(file_path, pipe_out):
with open(file_path, 'rb') as file_:
while True:
block = file_.read(block_size)
if not block:
break
pipe_out.send(block)
pipe_out.close()
I reckon that this will definitely make use of multiple cores, but also introduces some overhead:
multiprocess.Process(target=read, args).start()
But now I'm wondering if just doing this will also use multiple cores, minus the overhead:
read(*args)
Any insights anybody has as to which one would be faster and for what reason would be much appreciated!
Okay, as came out by the comments, the actual question is:
Does (C)Python create threads on its own, and if so, how can I make use of that?
Short answer: No.
But, the reason why these C-Functions are nevertheless interesting for Python programmers is the following. By default, no two snippets of python code running in the same interpreter can execute in parallel, this is due to the evil called the Global Interpreter Lock, aka the GIL. The GIL is held whenever the interpreter is executing Python code, which implies the above statement, that no two pieces of python code can run in parallel in the same interpreter.
Nevertheless, you can still make use of multithreading in python, namely when you're doing a lot of I/O or make a lot of use of external libraries like numpy, scipy, lxml and so on, which all know about the issue and release the GIL whenever they can (i.e. whenever they do not need to interact with the python interpreter).
I hope that cleared up the issue a bit.
I think this is the main part of your question:
The question is am I understanding this right and will I get better
performance with the read kept in the main process or in a separate
one?
I assume your goal is to read and process the file as fast as possible. File reading is in any case I/O bound and not CPU bound. You cannot process data faster than you are able to read it. So file I/O clearly limits the performance of your software. You cannot increase the read data rate by using concurrent threads/processes for file reading. Also 'low level' CPython is not doing this. As long as you read the file in one process or thread (even in case of CPython with its GIL a thread is fine), you will get as much data per time as you can get from the storage device. It is also fine if you do the file reading in the main thread as long as there are no other blocking calls that would actually slow down the file reading.