I want to make a real-time audio classification using python. I have a trained deep learning model but I only only give it wav file as an input. I want to make it real-time where it will use microphone as an input. Is that possible? If so, how will I do it?
Related
So me and my team are making an audio classification app for android. We used a python backend for classification, then connected it to the flutter app, but now we want to get rid of that and do it all in flutter with tflite. Problem is, we relied on librosa for our data preprocessing (getting a mel spectrogram) and we can't find any libraries to get mel spectrograms in flutter. Does anyone here know of one? Or can recommend another way to preprocess audio input for our tflite model?
Basically, is there a flutter library that can do this:
mels = np.mean(librosa.feature.melspectrogram(y=X, sr=sample_rate).T,axis=0)
I am trying to create a text to speech model using keras and I have my own data set which has text and only one audio feature .
I want to know which audio feature is best for text to speech. It is my FYP and I am little confused. Currently I am using melspectogram and when I reconstruct it it will produce sound with a little bit noise and bold sound.
I have successfully build an model of handwritten digits. How would I load the model and use it with live data coming from a video camera? I would like it to draw a box around the number and label the number.
Your question is very broad however there might be one video to answer all your qestions.
This is how use an ml model on your Android.
I am trying to train a cnn with the common voice dataset. I am new to speech recognition and am not able to find any links on how to use the dataset with keras. I followed this article to build a simple word classification network. But I want to scale it up with the common voice dataset. any help is appreciated.
Thank you
What you can do is looking at MFCCs. In short, these are features extracted from the audio waveform by using signal processing techniques to transcribe the way humans perceive sound. In python, you can use python-speech-features to compute MFCCs.
Once you have prepared your data, you can build a CNN; for example something like this one:
You can also use RNNs (LSTM or GRU for example), but this is a bit more advanced.
EDIT: A very good dataset to start, if you want:
Speech Commands Dataset
I have implemented a convolutional neural network by transfer learning using VGG19 to classify 5 different traffic signs. It works well with new test images, but when I apply the model upon video streaming it doesn't classify them correctly.
Assuming the neural network works well on images, it should work the same on frames of a video stream. In the end, a video stream is a sequence of images.
The problem is not that it doesn't work on video stream, it simply does not work on the type of images similar to the ones you have in the video stream.
It is hard to exactly find the problem, since the question does not have enough detail. However, some considerations are:
Obviously, there is a problem with the network's ability to generalize. Was the testing performed well? For example, is there a train-validation split of the data?
Does the training error and the validation error indicate any possible issues, such as overfitting?
Is the data used to train the image similar enough to the video frames?
Is the training dataset large enough? Data augmentation might help if there is not enough data.