Transferring Data using Audio using Python - python

I have a project that involves taking a string and converting into a sequence of sounds of different frequencies, and then reading the sound back into the original text.
Machine 1:
"Hello World" --> Some Audio
Machine 2:
Some Audio --> "Hello World"
Are there any libraries or projects out there that will allow me to do this? If not, any suggestions on how to accomplish this?

You will need to have a look at modulation techniques. The normal procedure is this:
Making binary data redundant with some error correction code
Modulation of the data to a discrete signal
D/A converter
Transfer over physical medium
A/D converter Sampling
Demodulation
Error correction
If you want to do this simpler, then you can skip the error correction part, but this bears the risk, that your whole data is corrupted under just slightly not optimal environment.
Let's have a quick look at the software parts of this.
Adding error correction codes
There are many codes to do this. A very simple one is just repeating every bit multiple times and in the error correction phase taking the average of all received bits.
Modulation
You have a sequence of ones and zeros and want to convert it to a wave pattern. You do this by mapping them to different base signals. In an easy case those signals can just be sinus signals of different frequency, in general they can be any signals, but should be orthogonal to be statistically independant.
Then you need to specify how long one bit will be sent, that is called symbol length. The longer you send your signal for a bit, the easier it is to detect it, however you can sent less data per time. Keep in mind that we are creating a discrete signal, which then goes through some D/A converter (our sound card).
An example
We want to sent the pattern 00110100 using a sine of 5000 Hz for a 0 and 10000 Hz for a 1. We choose our symbol length to be 1 ms, so it is a multiple of the period of both our base signals, which improves the shape.
So we send a sine of frequency 5000 Hz for 2 ms, then 10000 Hz for 2 ms, then 5000 Hz for 1 ms, 10000 Hz for 1ms and finally 5000 Hz for 2 ms.
To create the sampling points for this we have to choose an audio format. Let's use 44 kHz sampling frequency.
The code to do this is something similar to this:
for bit in data:
for i in range(0, sampling_frequency * symbol_length):
signal.append(sin(i * sample_length * symbol_frequency(bit)))
sampling_frequency would be something like 44 kHz, symbol_length is 1ms, sample_length is 1/sampling_frequency, symbol_frequency is 5000 Hz for a 0 and 10000 Hz for a 1.
Demodulation
This can be done by a correlation function. Basically you assume you have a symbol and then look how similar your received signal is to the signal generated by that symbol. The similarity is the sum over all samples of the product of the received sample and the theoretical sample. If your frequency matches the signs should be equal throughout the signal, so this ends up as a big value, for different frequencies the signs change at different points and all will end up somewhere around zero. For our simple case you can calculate the correlation function with an assumed one and an assumed zero and then use the bigger one as your received symbol.
To read and write your created audio to a file you can use the default python wave library: https://docs.python.org/2/library/wave.html

Related

Sound activated recording and advanced filtering on Raspberry Pi

I'm making a Raspberry Pi bat detector using a USB-powered ultrasonic microphone. I want to be able to record bats while excluding insects and other non-bat noises. Recording needs to be sound-triggered to avoid filling the SD card too quickly and to aid with analysis. This website explains how to do this with SoX:
rec - c1 -r 192000 record.wav sinc 10k silence 1 0.1 1% trim 0 5
This records for 5 seconds after a trigger sound of at least 0.1 seconds and includes a 10kHz high pass filter. This is a good start, but what I'd really like is an advanced filter that excludes crickets and other non-bat noises. Insect and bat calls overlap in frequency so a high pass or band filter won't do.
The Elekon Batlogger does this with a period trigger that analyses zero crossings. From the Batlogger website:
The difference in sound production of bats (vocal cords) and insects
(stridulation) affects the period continuity. The period trigger takes
advantage of this:
The trigger fires when ProdVal and DivVal are lower than the set
limits, so if the values ​​are within the yellow range.
(Values mean default values): ProdVal = 8, higher values ​​trigger
easier DivVal = 20, higher values ​​trigger easier
Translated text from the image:
Bat: Tonal signal
Period constant => zero crossings / time = stable
Insects: scratching
Period constant => zero crossings / time = differs
MN => mean value of the number of periods per measurement interval
SD => standard deviation of the number of periods
Higher values trigger better even at low frequencies (also insects!)
And vice versa
Is there a way to implement this (or something to the same effect) in Raspberry Pi OS? The language I'm most familiar with is R. Based on answers to this question it seems like R would be suitable for this problem, although if R isn't the best choice then I'm open to other suggestions.
I'd really appreciate some working code for recording audio and filtering as described above. My desired output is 5 second files that contain bat calls, not insects or noise. Needs to be efficient in terms of CPU / power use and needs to work on-the-fly.
Example recordings of bats and insects here.
UPDATE:
I've got a basic sound-activated script working in Python (based on this answer) but I'm not sure how to include an advanced filter in this:
import pyaudio
import wave
from array import array
import time
FORMAT=pyaudio.paInt16
CHANNELS=1
RATE=44100
CHUNK=1024
RECORD_SECONDS=5
audio=pyaudio.PyAudio()
stream=audio.open(format=FORMAT,channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK)
nighttime=True # I will expand this later
while nighttime:
data=stream.read(CHUNK)
data_chunk=array('h',data)
vol=max(data_chunk)
if(vol>=3000):
print("recording triggered")
frames=[]
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
print("recording saved")
# write to file
words = ["RECORDING-", time.strftime("%Y%m%d-%H%M%S"), ".wav"]
FILE_NAME= "".join(words)
wavfile=wave.open(FILE_NAME,'wb')
wavfile.setnchannels(CHANNELS)
wavfile.setsampwidth(audio.get_sample_size(FORMAT))
wavfile.setframerate(RATE)
wavfile.writeframes(b''.join(frames))
wavfile.close()
# check if still nighttime
nighttime=True # I will expand this later
stream.stop_stream()
stream.close()
audio.terminate()
TL;DR
R should be capable of doing this in post-processing. If you want this done on a live audio recording/stream, I'd suggest you look for a different tool.
Longer Answer
R is capable of processing audio files through several packages (most notable seem to be tuneR), but I'm fairly sure that will be limited to post-collection processing, i.e. analysing the files you have already collected, rather than a 'live' filtering of streaming audio input.
There are a couple of approaches you could take to 'live' filtering of the insect/unwanted sounds. One would be to just record files like you have listed above, and then write R code to process them (you could automate this on a schedule with cron for example) and discard parts or files that don't match your criteria. If you are concerned about SD card space you can also offload these files to another location after processing (i.e. upload to another drive somewhere). You could make this a fairly short time frame (at the risk of CPU usage on the Pi), to get an 'almost-live' processing approach.
Another approach would be to look more at the sox documentation and see if there are options in there to achieve what you want based on streaming audio input, or see if there is another tool that you can stream input to, that will allow that sort of filtering.
Viel Glück!

Python: Compare two audio files which may have noise

For a project purpose, I am recording audio clips(wave files) from different areas near a stage. I need to check if the source audio ie; the audio from the stage is highly audible in the nearby location of the stage using the audio recorded from the nearby places.
More clearly, I have microphones at nearby places of a stage and I have audio clips from stage and these nearby places. How can I check if the sound from the stage is received to the nearby location or how can I understand the sound from the stage is making a disturbance to the nearby places.
Sounds like an interesting project ... to give a nuts and bolts approach since your question could tap into vast fields like perception and convolutional neural networks ... first assure your audio files are aligned in time ... feed a window of audio samples (say 2^12 that is 4096, or more yet always a power of 2) into a FFT call (Discrete Fourier Transform) which will give you an array of frequency bins each with a magnitude (ignore phase) ... then compare this FFT array between your stage mic and each of surrounding mic files ... then repeat above after sliding this window of samples forward in time and repeat until you have visited the full set of samples ... you may want to try above using various widths of this sampling window
also try various ways to compare the FFT array between the pair of mic signals ... the frequency bins in the FFT array with the greatest magnitudes should be given greater weight in this comparison since you want to avoid allowing noise in low magnitude freq bins to muddy the waters - do this by squaring the freq bin magnitudes to accentuate the dominate freqs and attenuate the quieter freqs ... for simplicity at the start use a sin curve as your audio signal - search for a mobile app : Frequency Sound Generator - you will get a simpler FFT array ... goal here is just that one frequency from your source audio will appear here in the FFT output analysis
To perform above the only library you really need is the DFT call however if you do not have the luxury of time to roll your own to craft above approach these python repos may speed up your project
Librosa - Python library for audio and music analysis
https://librosa.github.io/
https://github.com/librosa/librosa
Madmom - Python audio and music signal processing library
https://madmom.readthedocs.io/en/latest/modules/audio/cepstrogram.html?highlight=mfcc
https://madmom.readthedocs.io
https://github.com/CPJKU/madmom
however I suggest you avoid using above libs and just roll your own - YMMV

Acurately mixing two notes over each other

I have a large library of many pre-recorded music notes (some ~1200), which are all of consistant amplitude.
I'm researching methods of layering two notes over each other so that it sounds like a chord where both notes are played at the same time.
Samples with different attack times:
As you can see, these samples have different peak amplitude points, which need to line up in order to sound like a human played chord.
Manually aligned attack points:
The 2nd image shows the attack points manually alligned by ear, but this is a unfeasable method for such a large data set where I wish to create many permutations of chord samples.
I'm considering a method whereby I identify the time of peak amplitude of two audio samples, and then align those two peak amplitude times when mixing the notes to create the chord. But I am unsure of how to go about such an implementation.
I'm thinking of using python mixing solution such as the one found here Mixing two audio files together with python with some tweaking to mix audio samples over each other.
I'm looking for ideas on how I can identify the times of peak amplitude in my audio samples, or if you have any thoughts on other ways this idea could be implemented I'd be very interested.
Incase anyone were actually interested in this question, I have found a solution to my problem. It's a little convoluded, but it has yeilded excellent results.
To find the time of peak amplitude of a sample, I found this thread here: Finding the 'volume' of a .wav at a given time where the top answer provided links to a scala library called AudioFile, which provided a method to find the peak amplite by going through a sample in frame buffer windows. However this library required all files to be in .aiff format, so a second library of samples was created consisting of all the old .wav samples converted to .aiff.
After reducing the frame buffer window, I was able to determine in which frame the highest amplitude was found. Dividing this frame by the sample rate of the audio samples (which was known to be 48000), I was able to accurately find the time of peak amplitude. This information was used to create a file which stored both the name of the sample file, along with its time of peak amplitude.
Once this was accomplished, a python script was written using the Pydub library http://pydub.com/ which would pair up two samples, and find the difference (t) in their times of peak amplitudes. The sample with the lowest time of peak amplitude would have silence of length (t) preappended to it from a .wav containing only silence.
These two samples were then overlayed onto each other to produce the accurately mixed chord!

Understanding the output of a DCT

I have some trouble understanding the output of the Discrete Cosine Transform.
Background:
I want to achive a simple audio compression by saving only the most relevant frequencies of a DCT. In order to be somewhat general, I would cut several audio tracks into pieces of a fixed size, say 5 seconds.
Then I would do a DCT on each sample and find out which are the most important frequencies among all short snippets.
This however does not work, which might be due to my missunderstanding of the DCT. See for example the images below:
The first image shows the DCT of the first 40 seconds of an audio track (wanted to make it long enough so that I get a good mix of frequencies).
The second image shows the DCT of the first ten seconds.
The thrird image shows the DCT of a reverse concatination (like abc->abccba) of the first 40 seconds
I added a vertical mark at 2e5 for comparison. Samplerate of the music is the usual 44.1 khz
So here are my questions:
What is the frequency that corresponds to an individual value of the DCT-output-vector? Is it bin/2? Like if I have a spike at bin=10000, which frequency in the real world does this correspond to?
Why does the first plot show strong amplitudes for so many more frquencies than the seond? My intuition was that the DCT would yield values for all frequencies up to 44.l khz (so bin number 88.2k if my assumption in #1 is correct), only that the scale of the spikes would be different, which would then make up the difference in the music.
Why does the third plot show strong amplitudes for more frequencies than the first does? I thought that by concatenating the data, I would not get any new frequencies.
As DCTand FFT/DFT are very similar, I tried to learn more about ft (this and this helped), but apparently it didn't suffice.
Figured it out myself. And it was indeed written in the link I posted in the question. The frequency that corresponds to a certain bin_id is given by (bin_id * freq/2) / (N/2). Which essentially boils down to bin_id*1/t with N=freq*t. This means that the plots just have different granularities. So if plot#1 has a high point at position x, plot#2 will likely show a high point at x/4 and plot#3 at x*2
The image blow shows the data of plot#1 stretched to twice its size (in blue) and the data of plot#3 in yellow

Is there a fast way to find (not necessarily recognize) human speech in an audio file?

I want to write a program that automatically syncs unsynced subtitles. One of the solutions I thought of is to somehow algorythmically find human speech and adjust the subtiles to it. The APIs I found (Google Speech API, Yandex SpeechKit) work with servers (which is not very convinient for me) and (probably) do a lot of unnecessary work determining what exactly has been said, while I only need to know that something has been said.
In other words, I want to give it the audio file and get something like this:
[(00:12, 00:26), (01:45, 01:49) ... , (25:21, 26:11)]
Is there a solution (preferably in python) that only finds human speech and runs on a local machine?
The technical term for what you are trying to do is called Voice Activity Detection (VAD). There is a python library called SPEAR that does it (among other things).
webrtcvad is a Python wrapper around Google's excellent WebRTC Voice Activity Detection (VAD) implementation--it does the best job of any VAD I've used as far as correctly classifying human speech, even with noisy audio.
To use it for your purpose, you would do something like this:
Convert file to be either 8 KHz or 16 Khz, 16-bit, mono format. This is required by the WebRTC code.
Create a VAD object: vad = webrtcvad.Vad()
Split the audio into 30 millisecond chunks.
Check each chunk to see if it contains speech: vad.is_speech(chunk, sample_rate)
The VAD output may be "noisy", and if it classifies a single 30 millisecond chunk of audio as speech you don't really want to output a time for that. You probably want to look over the past 0.3 seconds (or so) of audio and see if the majority of 30 millisecond chunks in that period are classified as speech. If they are, then you output the start time of that 0.3 second period as the beginning of speech. Then you do something similar to detect when the speech ends: Wait for a 0.3 second period of audio where the majority of 30 millisecond chunks are not classified as speech by the VAD--when that happens, output the end time as the end of speech.
You may have to tweak the timing a little bit to get good results for your purposes--maybe you decide that you need 0.2 seconds of audio where more than 30% of chunks are classified as speech by the VAD before you trigger, and 1.0 seconds of audio with more than 50% of chunks classified as non-speech before you de-trigger.
A ring buffer (collections.deque in Python) is a helpful data structure for keeping track of the last N chunks of audio and their classification.
You could run a window across your audio file, and try to extract what fraction of power of the total signal is human vocal frequency ( fundamental frequencies lie between 50 and 300 Hz) . The following is to give intuition and is untested on real audio.
import scipy.fftpack as sf
import numpy as np
def hasHumanVoice(X, threshold, F_sample, Low_cutoff=50, High_cutoff= 300):
""" Searching presence of frequencies on a real signal using FFT
Inputs
=======
X: 1-D numpy array, the real time domain audio signal (single channel time series)
Low_cutoff: float, frequency components below this frequency will not pass the filter (physical frequency in unit of Hz)
High_cutoff: float, frequency components above this frequency will not pass the filter (physical frequency in unit of Hz)
F_sample: float, the sampling frequency of the signal (physical frequency in unit of Hz)
threshold: Has to be standardized once to say how much power must be there in real vocal signal frequencies.
"""
M = X.size # let M be the length of the time series
Spectrum = sf.rfft(X, n=M)
[Low_cutoff, High_cutoff, F_sample] = map(float, [Low_cutoff, High_cutoff, F_sample])
#Convert cutoff frequencies into points on spectrum
[Low_point, High_point] = map(lambda F: F/F_sample * M, [Low_cutoff, High_cutoff])
totalPower = np.sum(Spectrum)
fractionPowerInSignal = np.sum(Spectrum[Low_point : High_point])/totalPower # Calculating fraction of power in these frequencies
if fractionPowerInSignal > threshold:
return 1
else:
return 0
voiceVector = []
for window in fullAudio: # Run a window of appropriate length across the audio file
voiceVector.append (hasHumanVoice( window, threshold, samplingRate)

Categories

Resources