I made a Telegram robot, and one of its jobs is to create samples from audio files. Now for most audios that is sent to it, the sample is perfectly fine; something like this:
However, for some audios, the sample looks a bit odd:
As you can see, the waves in this file are not shown! (I can assure you that the voice is not empty)
For creating the sample, I use pydub (Thanks, James!). Here's the part that I create the sample:
song = AudioSegment.from_mp3('song.mp3')
sliced = song[start*1000:end*1000]
sliced.export('song.ogg', format='ogg', parameters=["-acodec", "libopus"])
And then I send the sample using bot.send_voice method. Like this:
bot.send_voice(
chat_id=update.message.chat.id,
voice=open('song.ogg', 'rb'),
caption=settings.caption,
parse_mode=ParseMode.MARKDOWN,
timeout=1000
)
The documentation of Telegram Bot API says:
Use this method to send audio files, if you want Telegram clients to
display the file as a playable voice message. For this to work, your
audio must be in an .ogg file encoded with OPUS (other formats may be
sent as Audio or Document).
That's why in this line of code:
sliced.export('song.ogg', format='ogg', parameters=["-acodec", "libopus"])
I used parameters=["-acodec", "libopus"].
Can anyone tell me what I'm doing wrong? Thanks in advance!
Shot in the dark guess:
Having just sampled those two Muse songs, "Pressure" is a much louder rock song than "The Void". I suspect Telegram service itself just detects the music as noise when performing speech to text translation. Unlike speech, which has an wide dynamic range between spoken words, music tends to be all the same volume. Hence, the relative volume of each sample is relatively the same - hence, a flat line.
Since it happen only to some of the songs, I believe the issues is linked with the original song format. Make sure that pudub got file parameters right, e.g.: number of channels, sample width, frame rate, etc. Sometimes the resulting format also changes, so you can get audio in range [-1..1] (float), and sometimes [-32767..32768] (integer).
Related
I have an audio file and a text that corresponds to the speech in this audio file.
Is there any way to match the text to the audio so that I get something like timestamps that show where the words in the text file appear in the audio.
So I have found exactly what I was looking for.
Apparently the technology that matches a given Text to an Audio and returns the exact timestamps is called Forced Alignment.
Here is an extremely useful link to a list of the best forced alignment tools: https://github.com/pettarin/forced-alignment-tools
Personally, I have used Aeneas as it worked really well for me.
Yes, that is possible. I am assuming you are aware of basic terminology around the audio tech.
Check library https://www.geeksforgeeks.org/python-speech-recognition-on-large-audio-files/
The library can read any audio file chunk by chunk. One could pass the file for audio to text conversion and further can collect the result of text chunk by chunk.
Also, If the SampleRate of the Audio File is 44100, then 8192 chunks will represent a time unit around 185 milliseconds.
My goal is to be able to detect a specific noise that comes through the speakers of a PC using Python. That means the following, in pseudo code:
Sound is being played out of the speakers, by applications such as games for example,
ny "audio to detect" sound happens, and I want to detect that, and take an action
The specific sound I want to detect can be found here.
If I break that down, i believe I need two things:
A way to sample the audio that is being streamed to an audio device
I actually have this bit working -- with the code found here : https://gist.github.com/renegadeandy/8424327f471f52a1b656bfb1c4ddf3e8 -- it is based off of sounddevice example plot - which I combine with an audio loopback device. This allows my code, to receive a callback with data that is played to the speakers.
A way to compare each sample with my "audio to detect" sound file.
The detection does not need to be exact - it just needs to be close. For example there will be lots of other noises happening at the same time, so its more being able to detect the footprint of the "audio to detect" within the audio stream of a variety of sounds.
Having investigated this, I found technologies mentioned in this post on SO and also this interesting article on Chromaprint. The Chromaprint article uses fpcalc to generate fingerprints, but because my "audio to detect" is around 1 - 2 seconds, fpcalc can't generate the fingerprint. I need something which works across smaller timespaces.
Can somebody help me with the problem #2 as detailed above?
How should I attempt this comparison (ideally with a little example), based upon my sampling using sounddevice in the audio_callback function.
Many thanks in advance.
I'm currently working on a little python script to equalize MP3 file.
I've read some docs about MP3 file format (at https://en.wikipedia.org/wiki/ID3)
And i've noticed that in the ID3v2 format there is a field for Equalization (EQUA, EQU2)
Using the python librarie mutagen i've tried to extract theses information from the MP3 but the field isn't present.
What's the right way to equalize MP3 file regardless of the ID3 version ?
Thank in advance. Creekorful
There are two high-level approaches you can take: modify the encoded audio stream, or put metadata on it describing the desired change. Modifying the audio stream is the most compatible, but generally less desirable. However, ID3v1 has no place for this metadata, only ID3v2.2 and up do.
Depending on what you mean by equalize, you might want equalization information stored in the EQA/EQUA/EQU2 frames, or a replay gain volume adjustment stored in the RVA/RVAD/RVA2 frames. Mutagen supports the linked frames, so all but EQA/EQUA. If you need them, it should be straightforward to add them from the information in the actual specification (see 4.12 on http://id3.org/id3v2.4.0-frames). With tests they could likely be contributed back to the project.
Note that Quod Libet, the player paired with Mutagen, has taken a preference for reading and storing replay gain information in a TXXX frame.
I used the python wave module and read the first frame from a .wav file and it returned this :
b'\x00\x00\x00\x00\x00\x00'
What does each byte mean and will it be the same for every frame or for just some?
I've done some research into the subject and have found that there are bytes that give information about the .wav file in front of the sound data, so does python miss out this information and skip straight to the sound data or do I have to manually separate it?
There are 2 channels and a sample width of 3 according to python.
UPDATE
I have successfully created the waveform for the wav file, it wasn't as difficult as I first thought, now to show it whilst the song is playing....
The wave module reads the header for you, which is why it can tell you how many channels there are, and what the sample width is.
Reading frames gives you direct access to the raw sample data, but because the WAV format is a bit of a mixed, confused beast it depends on the sample width and channel count how you need to interpret each frame. See this article for a good in-depth discussion on that.
So, I'm planning on trying out making a light organ with an Arduino and Python, communicating over serial to control the brightness of several LEDs. The computer will use the microphone or a playing MP3 to generate the data.
I'm not so sure how to handle the audio processing. What's a good option for python that can take either a playing audio file or microphone data (I'd prefer the microphone), and then split it into different frequency ranges and write the intensity to variables? Do I need to worry about overtones if I use the microphone?
If you're not committed to using Python, you should also look at using PureData (PD) to handle the audio analysis. Interfacing PD to the Arduino is already a solved problem, and there are a lot of pre-existing components that make working with audio easy.
Try http://wiki.python.org/moin/Audio for links to various Python audio processing packages.
The audioop package has some basic waveform manipulation functions.
See also:
Detect and record a sound with python
Detect & Record Audio in Python
Portaudio has a Python interface that would let you read data off the microphone.
For the band splitting, you could use something like a band-pass filter feeding into an envelope follower -- one filter+follower for each frequency band of interest.