I am confused about these codes. Basically, I am trying using MIDIDDSP to synthesis midi file. However, I couldn't find how to convert the following codes to MIDI. The link of this model is Minimal example
Here are the simple codes they've provided, and I didn't add anything, just follow what they've provided.
from midi_ddsp import synthesize_midi, load_pretrained_model
midi_file = 'ode_to_joy.mid'
# Load pre-trained model
synthesis_generator, expression_generator = load_pretrained_model()
# Synthesize MIDI
output = synthesize_midi(synthesis_generator, expression_generator, midi_file)
# The synthesized audio
synthesized_audio = output['mix_audio']
synthesized_audio
Synthesized_audio return an array, neither a wav nor midi file. Hope someone can tell me how to convert it to audio or midi. Thanks a lot
Related
I am trying to write a model for audio classification, for that I am using torchaudio.datasets.SPEECHCOMMANDS dataset. I have trained the model but on custom sounds for prediction and evaluation, I am having issues with converting them to same format as used by my model.
Following is torchaudio.info command and result of my dataset audio file:
metadata = torchaudio.info("./data/SpeechCommands/speech_commands_v0.02/yes/00f0204f_nohash_0.wav")
print("Origional :", metadata)
Result:
Origional : AudioMetaData(sample_rate=16000, num_frames=16000, num_channels=1, bits_per_sample=16, encoding=PCM_S)
I am using pydub to apply transformation to my data. Following is the code I have used:
from pydub import AudioSegment
sound = AudioSegment.from_wav("/content/yes.wav") # yes.wav is my recorded file
sound = sound.set_channels(1)
sound = sound.set_frame_rate(16000)
sound.export("/content/yes_mono.wav", format="wav")
Still after these transformation, my custom audio file's format is different to the one in dataset.
metadata = torchaudio.info("/content/yes_mono.wav")
print("Custom Mono:", metadata)
#Result: Custom Mono: AudioMetaData(sample_rate=16000, num_frames=36864, num_channels=1, bits_per_sample=16, encoding=PCM_S)
Here num_frames are different and I have been unable to find a way to solve this. Please suggest a solution or highlight issue in my understanding.
I'm using Microsoft Cognitive Services speech-to-text python API for transcription.
Right now, I'm getting a sound through web API (using the microphone part here: https://ricardodeazambuja.com/deep_learning/2019/03/09/audio_and_video_google_colab/) and then I write the sound to 'sound.wav' and then I send 'sound.wav' to MCS STT engine to get the transcription. The Web API gives me a numpy array together with the sample rate of the sound.
My Question is: Is it possible to send the numpy array and the sample rate directly to MCS STT instead of wrting a wav file?
Here is my code:
import azure.cognitiveservices.speech as speechsdk
import scipy.io.wavfile
audio, sr = get_audio()
p = 'sound.wav'
scipy.io.wavfile.write(p,sr,audio)
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
audio_input = speechsdk.AudioConfig(filename=p)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)
Based upon my research & looking through the code :
You will not be able to use the directly Mic in a Google Collab - because the instance in which the python gets executed - you will less likely have access/operate the same. Hence you made use of the article which facilitates in recording the audio at the web browser level.
Now - the recorded audio is in the WEBM format.As per code, they further made use of the FFMPEG in order to convert to WAV format.
But however, please note that this will have the headers in addition to the audio data
Now this is not returned in the below snippet code - instead of returning the audio,sr in the get_audio() you will have to return the riff - this is the WAV AUDIO in bytes (but this includes the header in addition to the audio data)
Came accross the post which explains the composition of the WAV file at the byte level (this can be related to the output)
http://soundfile.sapp.org/doc/WaveFormat/
In this you will have to strip out the audio data bytes,sample per second and all the necessary data & use the PushAudioInputStream method
SAMPLE
channels = 1
bitsPerSample = 16
samplesPerSecond = 16000
audioFormat = AudioStreamFormat(samplesPerSecond, bitsPerSample, channels)
custom_push_stream = speechsdk.audio.PushAudioInputStream(stream_format=audioFormat)
In this custom_push_stream - you can write the audio data to do a STT
custom_push_stream.write(audiodata)
I have generated a .wav audio file containing some speech with some other interference speech in the background.
This code worked for me for a test .wav file:
import speech_recognition as sr
r = sr.Recognizer()
with sr.WavFile(wav_path) as source:
audio = r.record(source)
text = r.recognize_google(audio)
If I use my .wav file, I get the following error:
ValueError: Audio file could not be read as PCM WAV, AIFF/AIFF-C, or Native FLAC; check if file is corrupted or in another format
The situation slightly improves if I save this .wav file with soundfile:
import soundfile as sf
wav, samplerate = sf.read(wav_path)
sf.write(saved_wav_path, original_wav, fs)
and then load the new saved_wav_path back into the first block of code, this time I get:
if not isinstance(actual_result, dict) or len(actual_result.get("alternative", [])) == 0: raise UnknownValueError()
The audio files were saved as
wavfile.write(wav_path, fs, data)
where wav_path = 'data.wav'. Any ideas?
SOLUTION:
Saving the audio data the following way generates the correct .wav files:
import wavio
wavio.write(wav_path, data, fs ,sampwidth=2)
From a brief look at the code in the speech_recognition package, it appears that it uses wave from the Python standard library to read WAV files. Python's wave library does not handle floating point WAV files, so you'll have to ensure that you use speech_recognition with files that were saved in an integer format.
SciPy's function scipy.io.wavfile.write will create an integer file if you pass it an array of integers. So if data is a floating point numpy array, you could try this:
from scipy.io import wavfile
# Convert `data` to 32 bit integers:
y = (np.iinfo(np.int32).max * (data/np.abs(data).max())).astype(np.int32)
wavfile.write(wav_path, fs, y)
Then try to read that file with speech_recognition.
Alternatively, you could use wavio (a small library that I created) to save your data to a WAV file. It also uses Python's wave library to create its output, so speech_recognition should be able to read the files that it creates.
I couldn't figure out what the sampwidth should be for wavio from its documentation; however, I added the following line sounddevice.default.dtype='int32', 'int32' which allowed sounddevice, scipy.io.wavfile.write / soundfile, and speech_recognizer to finally work together. The default dtype for sounddevice was float32 for both input and output. I tried changing only the output but it didnt work. Weirdly, audacity still thinks the output files are in float32. I am not suggesting this is a better solution, but it did work with both soundfile and scipy.
I also noticed another oddity. When sounddevice.default.dtype was left at the default [float32, float32] and I opened the resulting file in audacity. From audacity, I exported it and this exported wav would work with speechrecognizer. Audacity says its export is float32 and the same samplerate, so I don't fully understand. I am a noob but looked at both files in a hex editor and they look the same for the first 64 hex values then they differ... so it seems like the header is the same. Those two look very different than the file I made using int32 output, so seems like there's another factor at play...
Similar to Warren's answer, I was able to resolve this issue by rewriting the WAV file using pydub:
from pydub import AudioSegment
filename = "payload.wav" # File that already exists.
sound = AudioSegment.from_mp3(filename)
sound.export(filename, format="wav")
I have the following code to load a .wav file and play it:
import base64
import winsound
with open('file.wav','rb') as f:
data = base64.b64encode(f.read())
winsound.PlaySound(base64.b64decode(data), winsound.SND_MEMORY)
It plays the file no problem but now I would like to extract a 'chunk' let's say from 233 to 300 and play that portion only.
seg = data[233:300]
winsound.PlaySound(base64.b64decode(seg), winsound.SND_MEMORY)
I Get: TypeError: 'sound' must be str or None, not 'bytes'
PlaySound() is expecting a fully-formed WAV file, not a segment of PCM audio data. From the docs for PlaySound(), the underlying function called by Python's winsound:
The SND_MEMORY flag indicates that the lpszSoundName parameter is a pointer to an in-memory image of the WAVE file.
(emphasis added)
Rather than playing around with the internals of WAV files (though that isn't too hard, if you're interested), I'd suggest a more flexible audio library like pygame's. There, you could use the music module and it's set_pos function or use the Sound class to access the raw audio data and cut it like you proposed in your question.
I've been writing a script using MoviePy. So far I've been able to import videos, clip them, add text, replace the audio and write a new file. It's been a great learning experience. My question is this:
The movie that I'm editing has audio attached. I'd like to be able to import an audio track and add it to the movie without replacing the original audio. In other words, I'd like to mix the new audio file with the audio that's attached to the video so both can be heard.
Does anyone know how to do this?
Thanks in advance!
I wrote my own version, but then I found this here:
new_audioclip = CompositeAudioClip([videoclip.audio, audioclip])
videoclip.audio = new_audioclip
So, create a CompositeAudioClip with the audio of the video clip and the new audio clip, then set the old videoclip's audio to the composite audio track.
Full working code:
from moviepy.editor import *
videoclip = VideoFileClip("filename.mp4")
audioclip = AudioFileClip("audioname.mp3")
new_audioclip = CompositeAudioClip([videoclip.audio, audioclip])
videoclip.audio = new_audioclip
videoclip.write_videofile("new_filename.mp4")
If you want to change an individual audioclip's volume, refer to audio.fx.volumex.
Documentation
Source Code