Understanding the output of a DCT - python

I have some trouble understanding the output of the Discrete Cosine Transform.
Background:
I want to achive a simple audio compression by saving only the most relevant frequencies of a DCT. In order to be somewhat general, I would cut several audio tracks into pieces of a fixed size, say 5 seconds.
Then I would do a DCT on each sample and find out which are the most important frequencies among all short snippets.
This however does not work, which might be due to my missunderstanding of the DCT. See for example the images below:
The first image shows the DCT of the first 40 seconds of an audio track (wanted to make it long enough so that I get a good mix of frequencies).
The second image shows the DCT of the first ten seconds.
The thrird image shows the DCT of a reverse concatination (like abc->abccba) of the first 40 seconds
I added a vertical mark at 2e5 for comparison. Samplerate of the music is the usual 44.1 khz
So here are my questions:
What is the frequency that corresponds to an individual value of the DCT-output-vector? Is it bin/2? Like if I have a spike at bin=10000, which frequency in the real world does this correspond to?
Why does the first plot show strong amplitudes for so many more frquencies than the seond? My intuition was that the DCT would yield values for all frequencies up to 44.l khz (so bin number 88.2k if my assumption in #1 is correct), only that the scale of the spikes would be different, which would then make up the difference in the music.
Why does the third plot show strong amplitudes for more frequencies than the first does? I thought that by concatenating the data, I would not get any new frequencies.
As DCTand FFT/DFT are very similar, I tried to learn more about ft (this and this helped), but apparently it didn't suffice.

Figured it out myself. And it was indeed written in the link I posted in the question. The frequency that corresponds to a certain bin_id is given by (bin_id * freq/2) / (N/2). Which essentially boils down to bin_id*1/t with N=freq*t. This means that the plots just have different granularities. So if plot#1 has a high point at position x, plot#2 will likely show a high point at x/4 and plot#3 at x*2
The image blow shows the data of plot#1 stretched to twice its size (in blue) and the data of plot#3 in yellow

Related

How to get complete fundamental (f0) frequency extraction with python lib librosa.pyin?

I am running librosa.pyin on a speech audio clip, and it doesn't seem to be extracting all the fundamentals (f0) from the first part of the recording.
librosa documentation: https://librosa.org/doc/main/generated/librosa.pyin.html
sr: 22050
fmin=librosa.note_to_hz('C0')
fmax=librosa.note_to_hz('C7')
f0, voiced_flag, voiced_probs = librosa.pyin(y,
fmin=fmin,
fmax=fmax,
pad_mode='constant',
n_thresholds = 10,
max_transition_rate = 100,
sr=sr)
Raw audio:
Spectrogram with fundamental tones, onssets, and onset strength, but the first part doesn't have any fundamental tones extracted.
link to audio file: https://jasonmhead.com/wp-content/uploads/2022/12/quick_fox.wav
times = librosa.times_like(o_env, sr=sr)
onset_frames = librosa.onset.onset_detect(onset_envelope=o_env, sr=sr)
Another view with power spectrogram:
I tried compressing the audio, but that didn't seem to work.
Any suggestions on what parameters I can adjust, or audio pre-processing that can be done to have fundamental tones extracted from all words?
What type of things affect fundamental tone extraction success?
TL;DR It seems like it's all about the parameters tweaking.
Here are some results that I've got playing with the example, it would be better to open it in a separate tab:
The bottom plot shows a phonetic transcription (well, kinda) of the example file. Some conclusions I've made to myself:
There are some words/parts of a word that are difficult to hear: they have low energy and when listening to them alone it doesn't sound like a word, but only when coupled with nearby segments ("the" is very short and sounds more like "z").
Some words are divided into parts (e.g. "fo"-"x").
I don't really know what should be the F0 frequency when someone pronounces "x". I'm not even sure that there is any difference in pronunciation between people (otherwise how do cats know that we are calling them all over the world).
Two-seconds period is a pretty short amount of time.
Some experiments:
If we want to see a smooth F0 graph, going with n_threshold=1 will do the thing. It's a bad idea. In the "voiced_flag" part of the graphs, we see that for n_threshold=1 it decides that each frame was voiced, counting every frequency change as activity.
Changing the sample rate affects the ability to retrieve F0 (in the rightmost graph, the sample rate was halved), as it was previously mentioned the n_threshold=1 doesn't count, but also we see that n_threshold=100 (which is a default value for pyin) doesn't produce any F0 at all.
Top most left (max_transition_rate=200) and middle (max_transition_rate=100) graphs show the extracted F0 for n_threshold=2 and n_threshold=100. Actually it degrades pretty fast, and n_threshold=3 looks almost the same as n_threshold=100. I find the lower part, the voiced_flag decision plot, has high importance when combined with the phonetics transcript. In the middle graph, default parameters recognise "qui", "jum", "over", "la". If we want F0 for other phonems, n_threshold=2 should do the work.
Setting n_threshold=3+ gives F0s in the same range. Increasing the max_transition_rate adds noice and reluctancy to declare that the voice segment is over.
That's my thoughts. Hope it helps.

Histogram for csv data

I have a csv file where one of the column stands for how many followers a Twitter user has:
The csv file has about 1 000 000 rows. I'd like to create a graph showing the distribution of followers number across the whole data. Since the range of followers number is quite big (starting from 0 followers up to hundred thousands) maybe the data on the graph should be quite approximate, it can be a graph where each bar represents 1000 followers or even more (so, 1st one would be 0-1000, then 1000-2000 etc). I hope I'm making myself clear.
I've tried a simple code but it gives a weird result.
df = pd.read_csv(".csv", encoding='utf8', delimiter=',')
df["user.followers_count"].hist()
Here's the result:
Does it have anything to do with the size and a large range of my data?
There is a bins argument in the hist function you have called. You just need to update it with a more reasonable value.
To understand: If the range is 1-10000, and you have set bins=10, then 1-1000 is one bin, 1000-2000 is another, and so on.
Increasing the number of bins (and thus reducing this range size) will help you get a smoother distribution curve and get what you are trying to achieve with this code/dataset.
Documentation link:
https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.hist.html
Given your data , i have the following output

Data analysis to extract period

I'm only asking this question because I've recently stumbled upon some clever code that I would have never thought of on my own. The code I refer to uses the numpy python library. It coverts the signal to true/false array based on if the signal is above a threshold. Then it generates an array that aligns with the middle of each bit. Then it reshapes the data into groups of 8. It takes a half dozen lines of code to analyze thousands of points of data. I've written code that does similar things but it walks through the entire dataset using for loops looking for edges and then converts those edges to bits. It takes literally hundreds of lines of code to do.
Pictured is an example of a dataset I'm trying to analyze. The beginning always has a preamble of 8 bits that are the same. I want to extract what the period of the signal is using the preamble.
Are there any methods for doing so in python without painstakingly looking for edges?
# Find the transitions
edges = np.abs(x[:-1] - x[1:]) > limit
# Find the indices where the transitions occur
indices = np.where(edges)[0]
# Count the elements as the difference between the indices
counts = indices[1:] - indices[:-1]

Acurately mixing two notes over each other

I have a large library of many pre-recorded music notes (some ~1200), which are all of consistant amplitude.
I'm researching methods of layering two notes over each other so that it sounds like a chord where both notes are played at the same time.
Samples with different attack times:
As you can see, these samples have different peak amplitude points, which need to line up in order to sound like a human played chord.
Manually aligned attack points:
The 2nd image shows the attack points manually alligned by ear, but this is a unfeasable method for such a large data set where I wish to create many permutations of chord samples.
I'm considering a method whereby I identify the time of peak amplitude of two audio samples, and then align those two peak amplitude times when mixing the notes to create the chord. But I am unsure of how to go about such an implementation.
I'm thinking of using python mixing solution such as the one found here Mixing two audio files together with python with some tweaking to mix audio samples over each other.
I'm looking for ideas on how I can identify the times of peak amplitude in my audio samples, or if you have any thoughts on other ways this idea could be implemented I'd be very interested.
Incase anyone were actually interested in this question, I have found a solution to my problem. It's a little convoluded, but it has yeilded excellent results.
To find the time of peak amplitude of a sample, I found this thread here: Finding the 'volume' of a .wav at a given time where the top answer provided links to a scala library called AudioFile, which provided a method to find the peak amplite by going through a sample in frame buffer windows. However this library required all files to be in .aiff format, so a second library of samples was created consisting of all the old .wav samples converted to .aiff.
After reducing the frame buffer window, I was able to determine in which frame the highest amplitude was found. Dividing this frame by the sample rate of the audio samples (which was known to be 48000), I was able to accurately find the time of peak amplitude. This information was used to create a file which stored both the name of the sample file, along with its time of peak amplitude.
Once this was accomplished, a python script was written using the Pydub library http://pydub.com/ which would pair up two samples, and find the difference (t) in their times of peak amplitudes. The sample with the lowest time of peak amplitude would have silence of length (t) preappended to it from a .wav containing only silence.
These two samples were then overlayed onto each other to produce the accurately mixed chord!

Extracting the threshold value of some given raw numbers

I am trying to determine the conditions of a wireless channel by analysis of captured I/Q samples. Indeed, I have a 50000 data samples and as it is shown in the attached figure, there are some sparks in the graphs when there is an activity (e.g. data transmission) over the channel. I am trying to count the number of sparks which are data values higher than a threshold.
I need to have an accurate estimation of the threshold and then I can find the channel load. the threshold value in the attached figure is around 0.0025 and it should be noted that it varies over time. So, each time that I took 50000 samples, I have to find the threshold value first using some sort of unsupervised learning.
I tried k-means (in python scikit-learn) to cluster the data and find the centroids of the estimated clusters, but it can't give me good estimation on the threshold value (especially when there is no activity over the channel and the channel is idle).
I would like to know is there anyone who has prior experience on similar topics?
Captured data
Since the idle noise seems relatively consistent and very different from when data is transmitted, I can think of several simple algorithms which could give you a reasonable threshold in an unsupervised manner.
The most direct method would be to sort the values (perhaps first group into buckets), then find the lowest-valued region where a large enough proportion (at least ~5%) of values fall. Take a reasonable margin above the highest values (50%?) and you should be good to go.
You'll need to fiddle with the thresholds a bit. I'd collect sample data and tweak the values until I get it working 100% of the time and the values used make sense.

Categories

Resources