firstly, this function is to remove silence of an audio.
here is the official description:
https://librosa.github.io/librosa/generated/librosa.effects.split.html
librosa.effects.split(y, top_db=10, *kargs)
Split an audio signal into non-silent intervals.
top_db:number > 0 The threshold (in decibels) below reference to consider as silence
return: intervals:np.ndarray, shape=(m, 2) intervals[i] == (start_i, end_i) are the start and end time (in samples) of non-silent interval i.
so this is quite straightforward, for any sound which is lower than 10dB, treat it as silence and remove from the audio. It will return me a list of intervals which are non-silent segments in the audio.
So I did a very simple example and the result confuses me:
the audio i load here is a 3 second humand talking, very normal takling.
y, sr = librosa.load(file_list[0]) #load the data
print(y.shape) -> (87495,)
intervals = librosa.effects.split(y, top_db=100)
intervals -> array([[0, 87495]])
#if i change 100 to 10
intervals = librosa.effects.split(y, top_db=10)
intervals -> array([[19456, 23040],
[27136, 31232],
[55296, 58880],
[64512, 67072]])
how is this possible...
I tell librosa, ok, for any sound which is below 100dB, treat it as silence.
under this setting, the whole audio should be treated as silence, and based on the document, it should give me array[[0,0]] something...because after remove silence, there is nothing left...
But it seems librosa returns me the silence part instead of the non-silence part.
librosa.effects.split()
It says in the documentation that it returns a numpy array that contains the intervals which contain non silent audio. These intervals of course depend on the value you assign to the parameter top_db. It does not return any audio, just the start and end points of the non-silent slices of your waveform
In your case, even if you set top_db = 100, it does not treat the entire audio as silence since they state in the documentation that they use The reference power. By default, it uses **np.max** and compares to the peak power in the signal. So setting your top_db higher than the maximal value that exists in your audio will actually result in top_db not having any effect.
Here's an example:
import librosa
import numpy as np
import matplotlib.pyplot as plt
# create a hypothetical waveform with 1000 noisy samples and 1000 silent samples
nonsilent = np.random.randint(11, 100, 1000) * 100
silent = np.zeros(1000)
wave = np.concatenate((nonsilent, silent))
# look at it
print(wave.shape)
plt.plot(wave)
plt.show()
# get the noisy interval
non_silent_interval = librosa.effects.split(wave, top_db=0.1, hop_length=1000)
print(non_silent_interval)
# plot only the noisy chunk of the waveform
plt.plot(wave[non_silent_interval[0][0]:non_silent_interval[0][1]])
plt.show()
# now set top_db higher than anything in your audio
non_silent_interval = librosa.effects.split(wave, top_db=1000, hop_length=1000)
print(non_silent_interval)
# and you'll get the entire audio again
plt.plot(wave[non_silent_interval[0][0]:non_silent_interval[0][1]])
plt.show()
You can see that non silent audio is from 0 to 1000 and the silent audio is from 1000 to 2000 samples:
Here it only gives us the noisy chunk of the wave we created:
And here is with top_db set at a 1000:
That means librosa did everything that it promised to do in the documentation. Hope this helps.
Related
I asked a similar question earlier, but I made the question more complex than it had to be. I am generating a 100 hz sine wave, that I then playback using simpleaudio.
Note: I had this problem when I encoded the wave to a .wav file. Sounded exactly the same as with simple audio. Also changing channels from 2 to 1 changes the sound, but does not fix this problem.
To install simple audio:
sudo apt-get install -y python3-dev libasound2-dev
python -m pip install simpleaudio
Stand alone code:
import numpy as np
import simpleaudio as sa
import matplotlib.pyplot as plt
def generate_sine_tone(numsamples, sample_time, frequency):
t = np.arange(numsamples) * sample_time # Time vector
signal = 8388605*np.sin(2*np.pi * frequency*t)
return signal
if __name__ == "__main__":
duration = 1
samprate = 44100 # Sampling rate
numsamples = samprate*duration# Sample count
st = 1.0 / samprate # Sample time
t = np.arange(numsamples) * st # Time vecto
nchannels = 2
sampwidth = 3
signal = generate_sine_tone(numsamples, st, 100)
signal2 = np.asarray([ int(x) for x in signal ])
play_obj = sa.play_buffer(signal2, nchannels, sampwidth, samprate)
print(signal2)
plt.figure(0)
plt.plot(signal2)
plt.show()
Running this in the command line will produce a graph of the sine wave for 1 second, or 44100, samples, which is 100 periods of the sine wave. It will also play the sound into your speakers, so turn your system sound down a good bit before running.
My other posts on this issue: Trying to generate a sine wave '.wav' file in Python. Comes out as a square wave
https://music.stackexchange.com/questions/110688/generated-sine-wave-in-python-comes-out-buzzy-or-square-ey
expected sound: https://www.youtube.com/watch?v=eDk1bOX-P3w&t=4s
received sound (approx): https://www.youtube.com/watch?v=F7DnVBJ9R34
This problem is annoying me sooo much, I would greatly appreciate any help that can be provided.
There are two problems here.
The lesser one is that you are creating a single array and playing it back as if it were stereo. You need to set nchannels = 1 (or duplicate all the values by creating an array with two columns).
The other problem is trying to create 24-bit samples. Very few people have good enough equipment and good enough ears to tell the difference between 24-bit and 16-bit audio. Using a sample width of 2 makes things much easier. You can generate 24-bit samples if you wish and normalize them to 16-bit for playback: signal *= 32767 / np.max(np.abs(signal))
This code works
import numpy as np
import simpleaudio as sa
def generate_sine_tone(numsamples, sample_time, frequency):
t = np.arange(numsamples) * sample_time # Time vector
signal = 32767*np.sin(2*np.pi * frequency*t)
return signal
duration = 1
samprate = 44100 # Sampling rate
numsamples = samprate*duration# Sample count
st = 1.0 / samprate # Sample time
nchannels = 1
sampwidth = 2
signal = generate_sine_tone(numsamples, st, 100)
signal2 = signal.astype(np.int16)
#signal2 = np.asarray([ int(x) for x in signal ])
play_obj = sa.play_buffer(signal2, nchannels, sampwidth, samprate)
play_obj.wait_done()
The simpleaudio.play_buffer() function does not convert your data. It only takes the exact memory buffer (i.e. the buffer it gets from the object you gave) and interprets it as what you claim it to contain. In your program your description of what the buffer contains (2 * 3 byte items) is not what it actually contains (1 * 8 byte items). Unfortunately in your example program this does not result in an error, because the size of the buffer you gave it coincidentally happens to be an exact multiple of 6, the size in bytes you claim your memory buffer's items to have. If you try it with one more sample, numsamples = 44101, you will get an error, because 44101 * 8 is not divisible by 6:
ValueError: Buffer size (in bytes) is not a multiple of bytes-per-sample and the number of channels.
Try what print(signal2.itemsize) shows. It's not the 3 * 2 that you claim it to be in your call to simpleaudio.play_buffer(). If the following is still correct, there's no way to get 24 bit buffers from Numpy even if you tried to: NumPy: 3-byte, 6-byte types (aka uint24, uint48)
And perhaps that's why the tutorial tells you to just use 16-bit data type for Numpy buffers, see https://github.com/hamiltron/py-simple-audio/blob/master/docs/tutorial.rst
Numpy arrays can be used to store audio but there are a few crucial
requirements. If they are to store stereo audio, the array must have
two columns since each column contains one channel of audio data. They
must also have a signed 16-bit integer dtype and the sample amplitude
values must consequently fall in the range of -32768 to 32767.
What are these "buffers"? They are a way for Python objects to pass low-level raw byte data between each other and libraries written in e.g. C. See this: https://docs.python.org/3/c-api/buffer.html or this: https://jakevdp.github.io/blog/2014/05/05/introduction-to-the-python-buffer-protocol/
If you want to create 24 bit buffers from your audio data, then you'll have to use some other library or low-level byte-by-byte hacking for creating the memory buffer, because Numpy won't do it for you. But you might be able to use dtype=numpy.float32 to get 32-bit floats that have 4-byte samples per channel. Simpleaudio detects this from the sample size, for example for Alsa:
https://github.com/hamiltron/py-simple-audio/blob/master/c_src/simpleaudio_alsa.c
/* set that format appropriately */
if (bytes_per_chan == 1) {
sample_format = SND_PCM_FORMAT_U8;
} else if (bytes_per_chan == 2) {
sample_format = SND_PCM_FORMAT_S16_LE;
} else if (bytes_per_chan == 3) {
sample_format = SND_PCM_FORMAT_S24_3LE;
} else if (bytes_per_chan == 4) {
sample_format = SND_PCM_FORMAT_FLOAT_LE;
} else {
ALSA_EXCEPTION("Unsupported Sample Format.", 0, "", err_msg_buf);
return NULL;
}
That's a little bit like using the weight of a vehicle for determining if it's a car, a motorcycle or a bicycle. It works, but it might feel odd to only be asked about the weight of a vehicle and not at all about its type.
So. To fix your program, use the dtype parameter of asarray() to convert your data to the buffer format you want, and declare the correct format in play_buffer(). And perhaps remove the scaling factor 8388605 from the sine generation, replace it with whatever you actually want and place it somewhere near the format specification.
My main task is to recognize a human humming from a microphone in real time. As the first step to recognizing signals in general, I have made a 5 seconds recording of a 440 Hz signal generated from an app on my phone and tried to detect the same frequency.
I used Audacity to plot and verify the spectrum from the same 440Hz wav file and I got this, which shows that 440Hz is indeed the dominant frequency :
(https://i.imgur.com/2UImEkR.png)
To do this with python, I use the PyAudio library and refer this blog. The code I have so far which I run with the wav file is this :
"""PyAudio Example: Play a WAVE file."""
import pyaudio
import wave
import sys
import struct
import numpy as np
import matplotlib.pyplot as plt
CHUNK = 1024
if len(sys.argv) < 2:
print("Plays a wave file.\n\nUsage: %s filename.wav" % sys.argv[0])
sys.exit(-1)
wf = wave.open(sys.argv[1], 'rb')
p = pyaudio.PyAudio()
stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
channels=wf.getnchannels(),
rate=wf.getframerate(),
output=True)
data = wf.readframes(CHUNK)
i = 0
while data != '':
i += 1
data_unpacked = struct.unpack('{n}h'.format(n= len(data)/2 ), data)
data_np = np.array(data_unpacked)
data_fft = np.fft.fft(data_np)
data_freq = np.abs(data_fft)/len(data_fft) # Dividing by length to normalize the amplitude as per https://www.mathworks.com/matlabcentral/answers/162846-amplitude-of-signal-after-fft-operation
print("Chunk: {} max_freq: {}".format(i,np.argmax(data_freq)))
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(data_freq)
ax.set_xscale('log')
plt.show()
stream.write(data)
data = wf.readframes(CHUNK)
stream.stop_stream()
stream.close()
p.terminate()
In the output, I get that the max frequency is 10 for all the chunks and an example of one of the plots is :
(https://i.imgur.com/zsAXME5.png)
I had expected this value to be 440 instead of 10 for all the chunks. I admit I know very little about the theory of FFTs and I appreciate any help in letting my solve this.
EDIT:
The sampling rate is 44100. no. of channels is 2 and sample width is also 2.
Forewords
As xdurch0 pointed out, you are reading a kind of index instead of a frequency. If you are about to make all computation by yourself you need to compute you own frequency vector before plotting if you want to get consistent result. Reading this answer may help you towards the solution.
The frequency vector for FFT (half plane) is:
f = np.linspace(0, rate/2, N_fft/2)
Or (full plane):
f = np.linspace(-rate/2, rate/2, N_fft)
On the other hand we can delegate most of the work to the excellent scipy.signal toolbox which aims to cope with this kind of problems (and many more).
MCVE
Using scipy package it is straight forward to get the desired result for a simple WAV file with a single frequency (source):
import numpy as np
from scipy import signal
from scipy.io import wavfile
import matplotlib.pyplot as plt
# Read the file (rate and data):
rate, data = wavfile.read('tone.wav') # See source
# Compute PSD:
f, P = signal.periodogram(data, rate) # Frequencies and PSD
# Display PSD:
fig, axe = plt.subplots()
axe.semilogy(f, P)
axe.set_xlim([0,500])
axe.set_ylim([1e-8, 1e10])
axe.set_xlabel(r'Frequency, $\nu$ $[\mathrm{Hz}]$')
axe.set_ylabel(r'PSD, $P$ $[\mathrm{AU^2Hz}^{-1}]$')
axe.set_title('Periodogram')
axe.grid(which='both')
Basically:
Read the wav file and get the sample rate (here 44.1kHz);
Compute the Power Spectrum Density and frequencies;
Then display it with matplotlib.
This outputs:
Find Peak
Then we can find the frequency of the first highest peak (P>1e-2, this criterion is subject to tuning) using find_peaks:
idx = signal.find_peaks(P, height=1e-2)[0][0]
f[idx] # 440.0 Hz
Putting all together it merely boils down to:
def freq(filename, setup={'height': 1e-2}):
rate, data = wavfile.read(filename)
f, P = signal.periodogram(data, rate)
return f[signal.find_peaks(P, **setup)[0][0]]
Handling multiple channels
I tried this code with my wav file, and got the error for the line
axe.semilogy(f, Pxx_den) as follows : ValueError: x and y must have
same first dimension. I checked the shapes and f has (2,) while
Pxx_den has (220160,2). Also, the Pxx_den array seems to have all
zeros only.
Wav file can hold multiple channels, mainly there are mono or stereo files (max. 2**16 - 1 channels). The problem you underlined occurs because of multiple channels file (stereo sample).
rate, data = wavfile.read('aaaah.wav') # Shape: (46447, 2), Rate: 48 kHz
It is not well documented, but the method signal.periodogram also performs on matrix and its input is not directly consistent with wavfile.read output (they perform on different axis by default). So we need to carefully orient dimensions (using axis switch) when performing PSD:
f, P = signal.periodogram(data, rate, axis=0, detrend='linear')
It also works with Transposition data.T but then we need to back transpose the result.
Specifying the axis solve the issue: frequency vector is correct and PSD is not null everywhere (before it performed on the axis=1 which is of length 2, in your case it performed 220160 PSD on 2-samples signals we wanted the converse).
The detrend switch ensure the signal has zero mean and its linear trend is removed.
Real application
This approach should work for real chunked samples, provided chunks hold enough data (see Nyquist-Shannon sampling theorem). Then data are sub-samples of the signal (chunks) and rate is kept constant since it does not change during the process.
Having chunks of size 2**10 seems to work, we can identify specific frequencies from them:
f, P = signal.periodogram(data[:2**10,:], rate, axis=0, detrend='linear') # Shapes: (513,) (513, 2)
idx0 = signal.find_peaks(P[:,0], threshold=0.01, distance=50)[0] # Peaks: [46.875, 2625., 13312.5, 16921.875] Hz
fig, axe = plt.subplots(2, 1, sharex=True, sharey=True)
axe[0].loglog(f, P[:,0])
axe[0].loglog(f[idx0], P[idx0,0], '.')
# [...]
At this point, the trickiest part is the fine tuning of find-peaks method to catch desired frequencies. You may need to consider to pre-filter your signal or post-process the PSD in order to make the identification easier.
I have a code here that displays the waveform of an audio file in PyQT Graph, unfortunately the graph seems so big.
I can't attached an image yet so I'll provide a link of the screenshot of the graph that I made.
And here is my code:
self.waveFile = wave.open(audio,'rb')
self.format = pyaudio.paInt16
channel = self.waveFile.getnchannels()
self.rate = self.waveFile.getframerate()
self.frame = self.waveFile.getnframes()
self.stream = p.open(format=self.format,
channels=channel,
rate=self.rate,
output=True)
durationF = self.frame / float(self.rate)
self.data_int = self.waveFile.readframes(self.frame)
self.data_plot = np.fromstring(self.data_int, 'Int16')
self.data_plot.shape = -1, 2
self.data_plot = self.data_plot.T
self.time = np.arange(0, self.frame) * (1.0 / self.rate)
w = pg.plot()
w.plot(self.time, self.data_plot[0])
Should I need to adjust X and Y range limits? Should I adjust the Y peak? As you can see the X(time) matches from the audio file that I used with 8 seconds duration. But the Y isn't(?). I am not sure how to adjust the data of the waveform so that it can be fit inside the window. Any response and suggestions will be of great help!
I think there are a couple of options depending on what you want to show.
1: Adjust the Y-limit
The simplest solution is to scale the Y axis.
# See docs for function setYrange
# setYRange(min, max, padding=None, update=True)
w.setRange(YRange=[min,max])
You can check the docs here.
That is if you want to keep all of the audio values the same as they are currently, although do you really need the audio data in terms of those values? Normally audio data is displayed as a float between -1 and +1 for scientific purposes at least.
2: Adjust your data
As said before audio data tends to be most useful when its scaled between -1 and +1; it's just easier for us to glance at it and instantly get a feeling for if the amplitude is correct (if we were testing a gain program for example). There are plenty of other Python libraries other than the built in wave module, which will handle this much easier for you like PySoundFile or many others (see this other SO post for other methods of reading .wav files in Python).
Otherwise you can convert the data received from the wave module to floating point data using something like this (props to yeeking for the code):
import wave
import struct
import sys
def wav_to_floats(wave_file):
w = wave.open(wave_file)
astr = w.readframes(w.getnframes())
# convert binary chunks to short
a = struct.unpack("%ih" % (w.getnframes()* w.getnchannels()), astr)
a = [float(val) / pow(2, 15) for val in a]
return a
# read the wav file specified as first command line arg
signal = wav_to_floats(sys.argv[1])
print "read "+str(len(signal))+" frames"
print "in the range "+str(min(signal))+" to "+str(min(signal))
If possible using a library is always better in this case, because the wave module as it stands doesn't support many audio use cases (as far as I was aware, only mono 16 bit audio).
Note: If you do convert it to -1 to +1 data probably still worthwhile adjusting the Y-Limit like explained in part 1. Just to avoid weird scaling when loading different .wav files.
I have a Audio Signal and have imported using wave.open function.
then I am converting the signal into frames and then using a window to take a set number of samples and checking their amplitude to see if it falls within a set threshold. If it does then I consider it as a Silence and if not then and Audio Signal.
import wave
import struct
import numpy as np
import matplotlib.pyplot as plt
sound_file = wave.open('Audio_1.wav', 'r')
file_length = sound_file.getnframes()
sound = np.zeros(file_length)
for i in range(file_length):
data = sound_file.readframes(1)
data = struct.unpack("<h", data)
sound[i] = int(data[0])
sound = np.divide(sound, float(2**15)) # Normalized data range [-1 to 1]
#print sound #vector corresponding to audio signal containing audio samples
Ap = np.pad(sound, (0,int(np.ceil(len(sound) / 2205.)) * 2205 - len(sound)), 'constant', constant_values=0) # Padding of the sound samples so that the input is a multiple of 2205(window Length)
Apr = Ap.reshape((len(Ap) // 2205, 2205))
Apr.shape
array1=(Apr ** 2).sum(axis=1) #Record Sum of Squares of the amplitude of the signal falling within that window
print array1
#print len(array1)
threshold =1103.4
result= np.array(filter(lambda x: x>= threshold, array1)) #filtering elements below set threshold
print result
print len(result)
print np.where(array1>1103.4) # finding starting index of the elements.
Below are my doubts:
How to find the ending index of the window? So that I can specifically slice that window out from the input.
How should I proceed so that I can get back the samples which contain the audio signal and convert those signals into frequency domain using np.fft.fft().
If any statement or question unclear. Please Specify.
Thank you
How do you remove "popping" and "clicking" sounds in audio constructed by concatenating sound tonal sound clips together?
I have this PyAudio code for generating a series of tones:
import time
import math
import pyaudio
class Beeper(object):
def __init__(self, **kwargs):
self.bitrate = kwargs.pop('bitrate', 16000)
self.channels = kwargs.pop('channels', 1)
self._p = pyaudio.PyAudio()
self.stream = self._p.open(
format = self._p.get_format_from_width(1),
channels = self.channels,
rate = self.bitrate,
output = True,
)
self._queue = []
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.stream.stop_stream()
self.stream.close()
def tone(self, frequency, length=1000, play=False, **kwargs):
number_of_frames = int(self.bitrate * length/1000.)
##TODO:fix pops?
g = get_generator()
for x in xrange(number_of_frames):
self._queue.append(chr(int(math.sin(x/((self.bitrate/float(frequency))/math.pi))*127+128)))
def play(self):
sound = ''.join(self._queue)
self.stream.write(sound)
time.sleep(0.1)
with Beeper(bitrate=88000, channels=2) as beeper:
i = 0
for f in xrange(1000, 800-1, int(round(-25/2.))):
i += 1
length = log(i+1) * 250/2./2.
beeper.tone(frequency=f, length=length)
beeper.play()
but when the tones changes, there's a distinctive "pop" in the audio, and I'm not sure how to remove it.
At first, I thought the pop was occurring because I was immediately playing each clip, and the time between each playback when I generate the clip was enough of a delay to cause the audio to flatline. However, when I concatenated all the clips into a single string and played that, the pop was still there.
Then, I thought the sine-waves weren't matching at the boundaries for each clip, so I tried to average the first N frames of the current audio clip with the last N frames of the previous clip, but that also had no effect.
What am I doing wrong? How do I fix this?
The answer you've written for yourself will do the trick but isn't really the correct way to do this type of thing.
One of the problems is your checking for the "tip" or peak of the sine wave by comparing against 1. Not all sine frequencies will hit that value or may require a large number of cycles to do so.
Mathematically speaking, the peak of the sine is at sin(pi/2 + 2piK) for all integer values of K.
To compute sine for a given frequency you use the formula y = sin(2pi * x * f0/fs) where x is the sample number, f0 is the sine frequency and fs is the sample rate.
For a nice number like 1kHz at 48kHz sample rate, when x=12 then:
sin(2pi * 12 * 1000/48000) = sin(2pi * 12/48) = sin(pi/2) = 1
However at a frequency like 997Hz then the true peak falls a fraction of a sample after sample 12.
sin(2pi * 12 * 997/48000) = 0.99087178042
sin(2pi * 12 * 997/48000) = 0.99998889671
sin(2pi * 12 * 997/48000) = 0.99209828673
A better method of stitching the waveforms together is to keep track of the phase from one tone and use that as the starting phase for the next.
First, for a given frequency you need to figure out the phase increment, notice it is the same as what you are doing with the sample factored out:
phInc = 2*pi*f0/fs
Next, compute the sine and update a variable representing the current phase.
for x in xrange(number_of_frames):
y = math.sin(self._phase);
self._phase += phaseInc;
Putting it all together:
def tone(self, frequency, length=1000, play=False, **kwargs):
number_of_frames = int(self.bitrate * length/1000.)
phInc = 2*math.pi*frequency/self.bitrate
for x in xrange(number_of_frames):
y = math.sin(self._phase)
_phase += phaseInc;
self._queue.append(chr(int(y)))
My initial suspicion that the individual waveforms weren't aligning was correct, which I confirmed by inspecting in Audacity. My solution was to modify the code to start and stop each waveform on the peak of the sine wave.
def tone(self, frequency, length=1000, play=False, **kwargs):
number_of_frames = int(self.bitrate * length/1000.)
record = False
x = 0
y = 0
while 1:
x += 1
v = math.sin(x/((self.bitrate/float(frequency))/math.pi))
# Find where the sin tip starts.
if round(v, 3) == +1:
record = True
if record:
self._queue.append(chr(int(v*127+128)))
y += 1
if y > number_of_frames and round(v, 3) == +1:
# Always end on the high tip of the sin wave to clips align.
break
If you are concatenating clips of varying attributes, you may hear clicking sound if peaks of two clips at the points of concatenation does not align.
One way to get around this is to do Fade-out at the end of first signal and then fade-in at the beginning of second signal. then continue this pattern through rest of the concatenation process. Check here for details on Fading.
I would try out concatenation in visual tools like Audacity , try Fade-out and fade-in on clips you want to join and play around with timing and settings to get desired results.
Next, I am not sure pyAudio has any easy way of implementation fading, however, if you can , you may want to try pyDub. It provides easy ways to manipulate audio. It has both Fade-in and Fade-out methods as well as cross-fade method, which basically performs both fade in and out in one step.
You can install pydub as pip install pydub
Here is a sample code for pyDub:
from pydub import AudioSegment
from pydub.playback import play
#Load first audio segment
audio1 = AudioSegment.from_wav("SineWave_440Hz.wav")
#Load second audio segment
audio2 = AudioSegment.from_wav("SineWave_150Hz.wav")
# 1.5 second crossfade
combinedAudio= audio1.append(audio2, crossfade=1500)
#Play combined Audio
play(combinedAudio)
Finally, if you really want to get noise / pops cleared at a professional grade, you may want to look at PSOLA (Pitch Synchronous Overlap and Add) .
Here one would convert audio signals to frequency domain and then perform PSOLA on chunks to merge the audio with minimum possible noise.
That was long, but hope it helps.