Optimization of successive processing of large audio collection with several programs - python

I was given a task to process a large collection of audiofiles. Each file must be processed in four steps:
convertion from .wav into raw pcm,
resampling,
quantization
coding with one of three speech codecs.
Each step corresponds to a program taking a file as input and returning a file as output. Processing file by file seems to take long. How can I optimize the procedure? E.g. parrallel programming or something? I tried to make use of ramdisk to reduce the time spent to file reading/writing but it didn't give improvement. (Why?)
I'm writing in Python under Ubuntu Linux. Thanks in advance.

Reading and writing to disk is pretty slow. If each program result is being written to disk then it would be better to stop that from happening. Sockets seem like a good fit to me. Read more here: http://docs.python.org/library/ipc.html
Parallel program is nice... need more info before I can say much more on this topic. I remember reading a while ago about python not handling threading so efficiently, so that might not be the best bet. As far as I recall it just emulates parallel processing by switching between tasks really gosh darn quickly. so that wont help. This may have changed since I've worked with threading.... Extra processes on the other hand sound like a good idea.
If you need a less-vague answer please supply specifics in your question.
EDIT
The thing i read a while ago about threads looks like this: http://docs.python.org/2/glossary.html#term-global-interpreter-lock

Related

Best Method for Importing Very Large CSV into Stata and/or Other Statistical Software

I am trying to import a csv file containing 15 columns totaling 64 GB. I am using Stata MP with a
512 GB RAM computer, which I would presume is sufficient for a task such as this. Yet, I started the import using an import delimited command over 3 days ago, and it still has not loaded into Stata (still shows importing). Has anyone else ran into an issue like this and, if so, is this the type of situation where I just need to wait longer and it will eventually import, or will I end up waiting forever?
Would anyone have recommendations on how to best tackle situations such as this? I've heard that SAS tends to be much more efficient for big data tasks since it doesn't need to read an entire dataset into memory at once, but I do not have any coding knowledge in SAS and am not even sure if I'd have a way to access it. I do have an understanding of Python and R, but I am unsure if either would be of benefit since I believe they also read an entire dataset into memory.

ffmpeg in python - extracting meta data

I use ffmpeg for Python to extract meta data from video files. I think the official documentation is available here: https://kkroening.github.io/ffmpeg-python/
To extract meta data (duration, resolution, frames per second, etc.) I use the function "ffmpeg.probe" as provided. Sadly, when running it on a large amount of video files, it is rather inefficient as it seems to (obviously?) load the whole file into memory each time to read just a small amount of data.
If this is not what it does, maybe someone could explain what the cause might be for the rather extensive runtime.
Otherwise, is there any way to retrieve meta data in a more efficient way using ffmpeg or some other library?
Any feedback or help is very much appreciated.
Edit: For clarity I added my code here:
pool = mp.Pool()
videos = []
for file in os.listdir(directory):
pool.apply_async(ffmpeg.probe, args=[os.path.join(directory, file)], callback=videos.append)
pool.close()
pool.join()
The imports and the definition of the paths are missing, but it should suffice to understand what is going on.
running it on a large amount of video files
This is where multithreading/multiprocessing could potentially be helpful IF the slowdown comes from subprocess spawning (and not from actual I/O). This may not help as file I/O in general takes time compared to virtually everything else.
load the whole file into memory each time to read just a small amount of data
This is incorrect assertion IMO. It should only read relevant headers/packets to retrieve the metadata. You are likely paying subprocess tax more than anything else.
a way to retrieve meta data
(1) Adding to what #Peter Hassaballeh said above, ffprobe has options to limit what to look up. If you only need to get the container(format)-level info or only of a particular stream, you can specify exactly what you need (to an extent). This could save some time.
(2) You can try MediaInfo (another free tool like ffprobe) which you should be able to call from Python as well.
(3) If you are dealing with a particular file format, the fastest way is to decode it yourself in Pyton, read only the bytes that matters to you. Depending on what is the current bottleneck, it may not be that drastic of an improvement, tho.
I suggest using the ffprobe directly. Unfortunately ffmpeg can be CPU expensive sometimes but it all depends on your hardware specs.

Python generator compute and store in background

I have a python generator which goes through a list of files and processes the data inside one by one. The order is important as I need the result from the previous file to calculate the next file so it's not an embarrassingly parallel task. When each file is processed I spit the data out of the generator to start the main calculation. I'm wondering if its possible to keep the generator running in the background and 'cache' the results although I don't have too much experience with this topic.
My code looks something like this -
for processedData in myGenerator():
bigCalculation(processedData)
and I'm looking for something like this -
for processedData in cleverParallelFunction(myGenerator()):
bigCalculation(processedData)
As a note the processedData is of reasonable size (a few GB) and the processing time is on the order of of reading from file time. I'm curious if multiprocessing is even helpful here as it sending data down the pipe might take quite some time as well but I'm not sure!
Any help would be really appreciated here!

Speed up feedparser

I'm using feedparser to print the top 5 Google news titles. I get all the information from the URL the same way as always.
x = 'https://news.google.com/news/feeds?pz=1&cf=all&ned=us&hl=en&topic=t&output=rss'
feed = fp.parse(x)
My problem is that I'm running this script when I start a shell, so that ~2 second lag gets quite annoying. Is this time delay primarily from communications through the network, or is it from parsing the file?
If it's from parsing the file, is there a way to only take what I need (since that is very minimal in this case)?
If it's from the former possibility, is there any way to speed this process up?
I suppose that a few delays are adding up:
The Python interpreter needs a while to start and import the module
Network communication takes a bit
Parsing probably consumes only little time but it does
I think there is no straightforward way of speeding things up, especially not the first point. My suggestion is that you have your feeds downloaded on a regularly basis (you could set up a cron job or write a Python daemon) and stored somewhere on your disk (i.e. a plain text file) so you just need to display them at your terminal's startup (echo would probably be the easiest and fastest).
I personally made good experiences with feedparser. I use it to download ~100 feeds every half hour with a Python daemon.
Parse at real time not better case if you want faster result.
You can try does it asynchronously by Celery or by similar other solutions. I like the Celery, it gives many abilities. There are abilities as task as the cron or async and more.

Multithreading / Multiprocessing in Python

I have created a simple substring search program that recursively looks through a folder and scans a large number of files. The program uses the Boyer-Moore-Horspool algorithm and is very efficient at parsing large amounts of data.
Link to program: http://pastebin.com/KqEMMMCT
What I'm trying to do now is make it even more efficient. If you look at the code, you'll notice that there are three different directories being searched. I would like to be able to create a process/thread that searches each directory concurrently, it would greatly speed up my program.
What is the best way to implement this? I have done some preliminary research, but my implementations have been unsuccessful. They seem to die after 25 minutes or so of processing (right now the single process version takes nearly 24 hours to run; it's a lot of data, and there are 648 unique keywords.)
I have done various experiments using the multiprocessing API and condensing all the various files into 3 files (one for each directory) and then mapping the files to memory via mmap(), but a: I'm not sure if this is the appropriate route to go, and b: my program kept dying at random points, and debugging was an absolute nightmare.
Yes, I have done extensive googleing, but I'm getting pretty confused between pools/threads/subprocesses/multithreading/multiprocessing.
I'm not asking for you to write my program, just help me understand the thought process needed to go about implementing a solution. Thank you!
FYI: I plan to open-source the code once I get the program running. I think it's a fairly useful script, and there are limited examples of real world implementations of multiprocessing available online.
What to do depends on what's slowing down the process.
If you're reading on a single disk, and disk I/O is slowing you down, multiple threads/process will probably just slow you down as the read head will now be jumping all over the place as different threads get control, and you'll be spending more time seeking than reading.
If you're reading on a single disk, and processing is slowing you down, then you might get a speedup from using multiprocessing to analyze the data, but you should still read from a single thread to avoid seek time delays (which are usually very long, multiple milliseconds).
If you're reading from multiple disks, and disk I/O is slowing you down, then either multiple threads or processes will probably give you a speed improvement. Threads are easier, and since most of your delay time is away from the processor, the GIL won't be in your way.
If you're reading from multiple disks,, and processing is slowing you down, then you'll need to go with multiprocessing.
Multiprocessing is easier to understand/use than multithreading(IMO). For my reasons, I suggest reading this section of TAOUP. Basically, everything a thread does, a process does, only the programmer has to do everything that the OS would handle. Sharing resources (memory/files/CPU cycles)? Learn locking/mutexes/semaphores and so on for threads. The OS does this for you if you use processes.
I would suggest building 4+ processes. 1 to pull data from the hard drive, and the other three to query it for their next piece. Perhaps a fifth process to stick it all together.
This naturally fits into generators. See the genfind example, along with the gengrep example that uses it.
Also on the same site, check out the coroutines section.

Categories

Resources