I'm new to working with large amounts of data. I have a pretty big data set (around 1 million audio files each a couple seconds long), and I'm trying to load the data in an efficient manner for visualization purposes (and eventually to use as training data in a neural network).
What I've tried so far is using librosa (used librosa.load(filename)) but this took a couple hours just to load 10,000 of the files. I tried to find out if I could use a GPU to speed it up (fumbled around with Numba) but I'm not clear if this is even a valid problem for a GPU to solve.
I feel like I'm missing something really obvious. Can someone more experienced tell me what to do? I am having a hard time trying to find the solution on the Internet. Thanks for the help!
You could use pygame.
In this mini program I made, I tested out how long I takes to load a sound file that is about 10 seconds long:
import pygame
import time
pygame.init()
time_now = time.time()
sound = pygame.mixer.music.load('music.wav')
print(time.time() - time_now)
And this is the result is:
0.0
And if you want to play that file, you do:
pygame.mixer.music.play(loops=int, start=float)
It will take about 1-4 hour(s) to load all of them still.
For further info, go to https://www.pygame.org/docs/ref/music.html .
Related
I am using a video which has around 30000 frames, trying to use the below FER code for emotion recognition
The entire process is taking anywhere between 10-15 hrs just to analyze the video?
Is there a way to speed up the processing time or any other algorithm to detect facial emotion?
Here is the code:
from fer import Video
from fer import FER
import os
import sys
import pandas as pd
location_videofile = "/Users/Akash/Desktop/videoplayback.mp4"
input_video = Video(location_videofile)
processing_data = input_video.analyze(face_detector, display=False, frequency=5)
Tried adding the frequency paramter in the analyze function as well, but of no use since the processing time is pretty much the same, i am assuming it affects the output and not the analyze function
With the following answer I will give you several solutions that may or may not work with your particular video.
The FER code relies on tensorflow and opencv for processing the data.
Assuming a default installation of these packages through pip, tensorflow is already running on gpu (you may want to double check that), while opencv is not.
Some of the functionalities of opencv can run on gpu and they may be the ones that FER is using: in this case, you may want to build the opencv package with GPU support (you can take a look here).
Another solution is to downsample the video-frames of you video by your own before supplying it to FER.
Downsample each frame of the video in order to reduce the number of pixels in each frame. This may give a huge speed-up, if you can afford it (i.e. faces are occupying much of the screen and the number of frame pixels is relatively high)
Multiprocessing. You could split the video in several mini-videos that you can analyse with multiple python processes. In my opinion, this is the cheapest and more reliable way to deal with the speed issue without loss in accuracy
I have a number of .mp3 files which all start with a short voice introduction followed by piano music. I would like to remove the voice part and just be left with the piano part, preferably using a Python script. The voice part is of variable length, ie I cannot use ffmpeg to remove a fixed number of seconds from the start of each file.
Is there a way of detecting the start of the piano part and then know how many seconds to remove using ffmpeg or even using Python itself?.
Thank you
This is a non-trivial problem if you want a good outcome.
Quick and dirty solutions would involve inferred parameters like:
"there's usually 15 seconds of no or low-db audio between the speaker and the piano"
"there's usually not 15 seconds of no or low-db audio in the middle of the piano piece"
and then use those parameters to try to get something "good enough" using audio analysis libraries.
I suspect you'll be disappointed with that approach given that I can think of many piano pieces with long pauses and this reads like a classic ML problem.
The best solution here is to use ML with a classification model and a large data set. Here's a walk-through that might help you get started. However, this isn't going to be a few minutes of coding. This is a typical ML task that will involve collecting and tagging lots of data (or having access to pre-tagged data), building a ML pipeline, training a neural net, and so forth.
Here's another link that may be helpful. He's using a pretrained model to reduce the amount of data required to get started, but you're still going to put in quite a bit of work to get this going.
I have .wav files sampled at 192kHz and want to split them based on time to many smaller files while keeping the same sample rate.
To start with I thought I would just open and re-save the wav file using pydub in order to learn how to do this. However when I save it it appears to resave at a much lower file size, I'm not sure why, perhaps the sample rate is lower? and I also can't open the new file with the audio analysis program I usually use (Song scope).
So I had two questions:
- How to open, read, copy and resave a wav file using pydub without changing it? (Sorry I know this is probably easy I just can't find it yet).
Whether Python and Pydub are a sensible choice for what I am trying to do? Or maybe there is a much simpler way.
what I am exactly trying to do is:
Split about 10 high sample frequency wav files (~ 1GB each) into many
(about 100) small wave files. (I plan to make a list of start and end
times for each of the smaller wav files needed then get Python to open
copy and resave the wav file data between those times).
I assume it is possible since I've seen questions for lower frequency wav files, but if you know otherwise or know of a simpler way please let me know. Thanks!!
My code so far is as follows:
from pydub import AudioSegment
# Input audio file to be sliced
audio = AudioSegment.from_wav("20190212_164446.wav")
audio.export("newWavFile.wav")
(I put the wav file and ffmpeg into the same directory as the Python file to save time since was having a lot of trouble getting pydub to find ffmpeg).
In case it's relevant the files are of bat calls, these bats make calls between around 1kHz and 50kHz which is quite low frequency for bats. I'm trying to crop out the actual calls from some very long files.
I know this is a basic question, I just couldn't find the answer yet, please also feel free to direct me to the answer if it's a duplicate.
thanks!!
I'm doing some work that requires me using one of these to parse text from a screen. I've tried implementing both and I'm just completely unsure which is faster and has less strain on my pc. Does anyone have any tips?
Bonus question: another what is faster question. Should I take a single screenshot of the screen and parse the data from there after cropping to the relevant sections, or should I take multiple screenshots of the screen with those dimensions right away and then parse the data? Again, I've tried both methods and I can't tell which is better/faster.
Thanks!!!!
Did you try timing them to see which one is faster? For example:
import time
start_time = time.time()
main() #your function using opencv/pytesseract or multi screenshot/cropped
print(time.time()-start_time)
I have a number of MP3 files containing lectures where the speaker talks very slowly and I would like to alter the MP3 file so that the playback rate is about 1.5 times as fast as normal.
Can someone suggest a good Python library for this? By the way, I'm running Python 2.6 on Windows.
Thanks in advance.
I wrote a library, pydub which is mainly designed for manipulating audio.
I've created an experimental time-stretching algorithm if you're interested in seeing how these sorts of things work.
Essentially you want to throw away a portion of your data, but you can't just play back the waveform faster because then it'll all get high pitched (as synthesizerpatel mentioned). Instead you want to throw away chunks (20 Hz is the lowest a human can hear so 50ms chunks do not cause audible frequency changes, though there are other artifacts).
PS - I get 50ms like so:
20 Hz == 1 second per 20 cycles
or
1000 ms per 20 cycles
or
1000ms / 20Hz == 50ms per cycle
pymedia includes a recode_audio.py example that allows arbitrary input and output formats available here. This of course requires the installation of pymedia as well.
Note that as Nick T notes, if you just change the sample-rate without resampling you'll get high-pitched 'fast' audio, so you'll want to employ time-stretching in combination with changing the bit-rate.
You can have a try on _spawn module in audio_segment.py of Pydub. Here is an example code:
from pydub import AudioSegment
import os
def speed_swifter(sound, speed=1.0):
return sound_with_altered_frame_rate = sound._spawn(sound.raw_data, overrides={"frame_rate": int(sound.frame_rate * speed)})
in_path = 'your/path/of/input_file/hello.mp3'
ex_path = 'your/path/of/output_file/hello.mp3'
sound = AudioSegment.from_file(in_path)
# generate a slower audio for example
slower_sound = speed_change(sound, 0.5)
slower_sound.export(os.path.join(ex_path, 'slower.mp3'), format="mp3")