ffmpeg in python - extracting meta data

ffmpeg in python - extracting meta data - python

I use ffmpeg for Python to extract meta data from video files. I think the official documentation is available here: https://kkroening.github.io/ffmpeg-python/
To extract meta data (duration, resolution, frames per second, etc.) I use the function "ffmpeg.probe" as provided. Sadly, when running it on a large amount of video files, it is rather inefficient as it seems to (obviously?) load the whole file into memory each time to read just a small amount of data.
If this is not what it does, maybe someone could explain what the cause might be for the rather extensive runtime.
Otherwise, is there any way to retrieve meta data in a more efficient way using ffmpeg or some other library?
Any feedback or help is very much appreciated.
Edit: For clarity I added my code here:
pool = mp.Pool()
videos = []
for file in os.listdir(directory):
pool.apply_async(ffmpeg.probe, args=[os.path.join(directory, file)], callback=videos.append)
pool.close()
pool.join()
The imports and the definition of the paths are missing, but it should suffice to understand what is going on.

running it on a large amount of video files
This is where multithreading/multiprocessing could potentially be helpful IF the slowdown comes from subprocess spawning (and not from actual I/O). This may not help as file I/O in general takes time compared to virtually everything else.
load the whole file into memory each time to read just a small amount of data
This is incorrect assertion IMO. It should only read relevant headers/packets to retrieve the metadata. You are likely paying subprocess tax more than anything else.
a way to retrieve meta data
(1) Adding to what #Peter Hassaballeh said above, ffprobe has options to limit what to look up. If you only need to get the container(format)-level info or only of a particular stream, you can specify exactly what you need (to an extent). This could save some time.
(2) You can try MediaInfo (another free tool like ffprobe) which you should be able to call from Python as well.
(3) If you are dealing with a particular file format, the fastest way is to decode it yourself in Pyton, read only the bytes that matters to you. Depending on what is the current bottleneck, it may not be that drastic of an improvement, tho.

I suggest using the ffprobe directly. Unfortunately ffmpeg can be CPU expensive sometimes but it all depends on your hardware specs.

Related

Knowing how to time a task when building a progress bar

In my program a user uploads a csv file.
While the file is uploading & being processed by my app, I'd like to show a progress bar.
The problem is that this process isn't entirely under my control (I can't really tell how long it'll take for the file to finish loading & be processed, as this depends on the file content and the size).
What would be the correct approach for doing this? It's not like I have many steps and I could increment the progress bar every time a step happens.... It's basically waiting for a file to be loaded, I cannot determine the time for that!
Is this even possible?
Thanks in advance

You don't give much detail, so I'll explain what I think is happening and give some suggestions from my thought process.
You have some kind of app that has some kind of function/process that
is a black-box (i.e you can't see inside it or change it), this
black-box uploads a csv file to some server and returns control back to
your app when it's done. Since you can't see inside the black-box you
can't determine how much it has uploaded and thus can't create an
accurate progress bar.
Named Pipes:
If you're passing only the filename of the csv to the black-box, you might be able to create a named pipe (depending on your situation.) Since named pipes block after the buffer is full - until the receiver reads it, you could keep track of how much has been read and thus create an accurate progress bar.
So you would create a named pipe, pass the black-box its filename, and then read in from the csv - and write to the named pipe. How far you've read in - is your progress.
More Pythonic:
Since you tagged Python, if you're passing the csv as a file-like object, this activestate recipe could help.
Same kind of idea just for Python.
Conclusion: These are two possible solutions. I'm getting tired, and there may be many more - but I can't help more since you haven't given us much to work with.
To answer your question at an abstract level: you can't make accurate progress bars for black-box functions, after all they could have a sleep(random()) call in them for all you know.
There are ways around this that are implementation specific, the two ideas above are examples: the idea being you can make the black-box take a stream instead, and count the bytes as you pass them through.
Alternatively you can guess/approximate, a rough calculation of how many bytes are going in and a (previously calculated) average speed per byte would give you some kind of indication of when it would complete. You could even save how long each run took in your code and do the previous idea automatically getting better each time.

Speed up feedparser

I'm using feedparser to print the top 5 Google news titles. I get all the information from the URL the same way as always.
x = 'https://news.google.com/news/feeds?pz=1&cf=all&ned=us&hl=en&topic=t&output=rss'
feed = fp.parse(x)
My problem is that I'm running this script when I start a shell, so that ~2 second lag gets quite annoying. Is this time delay primarily from communications through the network, or is it from parsing the file?
If it's from parsing the file, is there a way to only take what I need (since that is very minimal in this case)?
If it's from the former possibility, is there any way to speed this process up?

I suppose that a few delays are adding up:
The Python interpreter needs a while to start and import the module
Network communication takes a bit
Parsing probably consumes only little time but it does
I think there is no straightforward way of speeding things up, especially not the first point. My suggestion is that you have your feeds downloaded on a regularly basis (you could set up a cron job or write a Python daemon) and stored somewhere on your disk (i.e. a plain text file) so you just need to display them at your terminal's startup (echo would probably be the easiest and fastest).
I personally made good experiences with feedparser. I use it to download ~100 feeds every half hour with a Python daemon.

Parse at real time not better case if you want faster result.
You can try does it asynchronously by Celery or by similar other solutions. I like the Celery, it gives many abilities. There are abilities as task as the cron or async and more.

Optimization of successive processing of large audio collection with several programs

I was given a task to process a large collection of audiofiles. Each file must be processed in four steps:
convertion from .wav into raw pcm,
resampling,
quantization
coding with one of three speech codecs.
Each step corresponds to a program taking a file as input and returning a file as output. Processing file by file seems to take long. How can I optimize the procedure? E.g. parrallel programming or something? I tried to make use of ramdisk to reduce the time spent to file reading/writing but it didn't give improvement. (Why?)
I'm writing in Python under Ubuntu Linux. Thanks in advance.

Reading and writing to disk is pretty slow. If each program result is being written to disk then it would be better to stop that from happening. Sockets seem like a good fit to me. Read more here: http://docs.python.org/library/ipc.html
Parallel program is nice... need more info before I can say much more on this topic. I remember reading a while ago about python not handling threading so efficiently, so that might not be the best bet. As far as I recall it just emulates parallel processing by switching between tasks really gosh darn quickly. so that wont help. This may have changed since I've worked with threading.... Extra processes on the other hand sound like a good idea.
If you need a less-vague answer please supply specifics in your question.
EDIT
The thing i read a while ago about threads looks like this: http://docs.python.org/2/glossary.html#term-global-interpreter-lock

Data Carving loop improvements

I currently have a Python app I am developing which will data carve a block device for jpeg files. Let's just say that it sometimes works and sometimes doesn't. I have created it so that I read the block device till I find a ffd8, then I keep the stream open and search via looping for the ffd9 closure. Though I always need to take into account all ffd9 closures even after the first. So it tends to be a really intensive operation. Given a device with let's say 25 jpegs as well as lots of other data, the looping is pretty dramatic and it runs though a lot.
The program is not the slowest thing in the world, but I think it could be much faster and much more efficient. I am looking for a better way to search the block device and extract the data in a more efficient manner. I also don't want to kill the HDD or the drive holding the image of the block device.
So does anybody knew of a better way to systematically handle the searching and extraction of the data?

The trouble with reading the block device directly is that there is no guarantee that the blocks of any given file are contiguous. That means that even if you find your magic marker bytes 0xFFD8 in block 13, say, there is no guarantee that block 14 belongs to the same file, whether or not it contains the 0xFFD9 end marker or not. (Most files will start on a block boundary; the end of the file may be anywhere, possibly even across block boundaries.)
What's the better way to deal with it? Well, it depends what you're after - but if you're looking only at currently allocated blocks, then scan the file system using the Python analog of the POSIX C function ftw (nftw), and read each file in turn. This won't find evidence of deleted JPEG files in the free list - if that's what you are after, then you'll need to do as you are doing, more or less, but correlate that information with what you find in the file system proper. Mapping those blocks will (at best) be hard.

Python library to modify MP3 audio without transcoding

I am looking for some general advice about the mp3 format before I start a small project to make sure I am not on a wild-goose chase.
My understanding of the internals of the mp3 format is minimal. Ideally, I am looking for a library that would abstract those details away. I would prefer to use Python (but could be convinced otherwise).
I would like to modify a set of mp3 files in a fairly simple way. I am not so much interested in the ID3 tags but in the audio itself. I want to be able to delete sections (e.g. drop 10 seconds from the 3rd minute), and insert sections (e.g. add credits to the end.)
My understanding is that the mp3 format is lossy, and so decoding it to (for example) PCM format, making the modifications, and then encoding it again to MP3 will lower the audio quality. (I would love to hear that I am wrong.)
I conjecture that if I stay in mp3 format, there will be some sort of minimum frame or packet-size to deal with, so the granularity of the operations may be coarser. I can live with that, as long as I get an accuracy of within a couple of seconds.
I have looked at PyMedia, but it requires me to migrate to PCM to process the data. Similarly, LAME wants to help me encode, but not access the data in place. I have seen several other libraries that only deal with the ID3 tags.
Can anyone recommend a Python MP3 library? Alternatively, can you disabuse me of my assumption that going to PCM and back is bad and avoidable?

If you want to do things low-level, use pymad. It turns MP3s into a buffer of sample data.
If you want something a little higher-level, use the Echo Nest Remix API (disclosure: I wrote part of it for my dayjob). It includes a few examples. If you look at the cowbell example (i.e., MoreCowbell.dj), you'll see a fork of pymad that gives you a NumPy array instead of a buffer. That datatype makes it easier to slice out sections and do math on them.

I got three quality answers, and I thank you all for them. I haven't chosen any as the accepted answer, because each addressed one aspect, so I wanted to write a summary.
Do you need to work in MP3?
Transcoding to PCM and back to MP3 is unlikely to result in a drop in quality.
Don't optimise audio-quality prematurely; test it with a simple prototype and listen to it.
Working in MP3
Wikipedia has a summary of the MP3 File Format.
MP3 frames are short (1152 samples, or just a few milliseconds) allowing for moderate precision at that level.
However, Wikipedia warns that "Frames are not independent items ("byte reservoir") and therefore cannot be extracted on arbitrary frame boundaries."
Existing libraries are unlikely to be of assistance, if I really want to avoid decoding.
Working in PCM
There are several libraries at this level:
LAME (latest release: October 2017)
PyMedia (latest release: February 2006)
PyMad (Linux only? Decoder only? Latest release: January 2007)
Working at a higher level
Echo Nest Remix API (Mac or Linux only, at the moment) is an API to a web-service that supports quite sophisticated operations (e.g. finding the locations of music beats and tempo, etc.)
mp3DirectCut (Windows only) is a GUI that apparently performs the operations I want, but as an app. It is not open-source. (I tried to run it, got an Access Denied installer error, and didn't follow up. A GUI isn't suitably for me, as I want to repeatedly run these operations on a changing library of files.)
My plan is now to start out in PyMedia, using PCM.

Mp3 is lossy, but it is lossy in a very specific way. The algorithms used as designed to discard certain parts of the audio which your ears are unable to hear (or are very difficult to hear). Re-doing the compression process at the same level of compression over and over is likely to yield nearly identical results for a given piece of audio. However, some additional losses may slowly accumulate. If you're going to be modifying files a lot, this might be a bad idea. It would also be a bad idea if you were concerned about quality, but then using MP3 if you are concerned about quality is a bad idea over all.
You could construct a test using an encoder and a decoder to re-encode a few different mp3 files a few times and watch how they change, this could help you determine the rate of deterioration and figure out if it is acceptable to you. Sounds like you have libraries you could use to run this simple test already.
MP3 files are composed of "frames" of audio and so it should be possible, with some effort, to remove entire frames with minimal processing (remove the frame, update some minor details in the file header). I believe frames are pretty short (a few milliseconds each) which would give the precision you're looking for. So doing some reading on the MP3 File Format should give you enough information to code your own python library to do this. This is a fair bit different than traditional "audio processing" (since you don't care about precision) and so you're unlikely to find an existing library that does this. Most, as you've found, will decompress the audio first so you can have complete fine-grained control.

Not a direct answer to your needs, but check the mp3DirectCut software that does what you want (as a GUI app). I think that the source code is available, so even if you don't find a library, you could build one of your own, or build a python extension using code from mp3DirectCut.

As for removing or extracting mp3 segments from an mp3 file while staying in the MP3 domain (that is, without conversion to PCM format and back), there is also the open source package PyMp3Cut.
As for splicing MP3 files together (adding e.g. 'Credits' to the end or beginning of an mp3 file) I've found you can simply concatenate the MP3 files providing that the files have the same sampling rate (e.g. 44.1khz) and the same number of channels (e.g. both are stereo or both are mono).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.