How to plot normal distribution-like histogram? - python

I have data like [A,A,A,B,B,B,B,B,B,C,C,C,C,D,D,D,...]
And I convert it into numerical list like [1,1,1,2,2,2,2,2,2,3,3,3,3,4,4,4,...]
Each element has its frequency, for example, A shows up 3 times
I try to plot histogram and I get like this
Third element (probably C as character) shows up most often.
And I would like to place "third element vertical bar" in the center
And next to that center, I would like to place second and third frequent element to draw normal distribution-like arrangement.
In conclusion, I would like to see whether distribution of data has normal distribution shape or not
I checked this by using QQ plot but I also would like to see this in histogram plot using actual data

If I understood well what your goal is, I would recommend you to use the distplot function from seaborn. You will get both distribution and hist !

You have asked so many questions in a single post. I will answer the one regarding plotting frequency of occurrence. Suppose your list has strings. You can use Counter module to compute the frequencies. You can then directly plot the frequencies and items using plt.plot()
from collections import Counter
import matplotlib.pyplot as plt
lst = ['A','A','A','B','B','B','B','B','B','C','C','C','C','D','D','D','E', 'E','E','E']
counts = Counter(lst)
plt.bar(counts.keys(), counts.values())
plt.show()

Related

Generate random numbers from a distribution given by a list of numbers in python

Let's say I have a list of (float) numbers:
list_numbers = [0.27,0.26,0.64,0.61,0.81,0.83,0.78,0.79,0.05,0.12,0.07,0.06,0.38,0.35,0.04,0.03,0.46,0.01,0.18,0.15,0.36,0.36,0.26,0.26,0.93,0.12,0.31,0.28,1.03,1.03,0.85,0.47,0.77]
In my case, this is obtained from a pandas dataframe column, meaning that they are not bounded between any pair of values a priori.
The idea now is to obtain a new list of randomly-generated numbers, which follow the same distribution, meaning that, for a sufficiently large sample, both lists should have fairly similar histograms.
I tried using np.random.choice, but as I do not want to generate one of the values in the original list but rather new values which are or not in it, but follow the same distribution, it does not work...
As the person aforementiod the list is relatively small, so it is indeed hard to descide what the distribution looks like. Eventhough the following code might provide a solution to your problem:
import matplotlib.pyplot as plt
# Original list of list_numbers
list_numbers = [0.27,0.26,0.64,0.61,0.81,0.83,0.78,0.79,0.05,0.12,0.07,0.06,0.38,0.35,0.04,0.03,0.46,0.01,0.18,0.15,0.36,0.36,0.26,0.26,0.93,0.12,0.31,0.28,1.03,1.03,0.85,0.47,0.77]
# Construct histogram using 10 bins
counts, bin_edges = np.histogram(list_numbers, bins=10)
# Sample new list_numbers using the histogram
new_numbers = np.random.choice(bin_edges[:-1], size=len(list_numbers), p=counts/len(list_numbers))
# Create histograms for original and new list_numbers
plt.hist(list_numbers, bins=bin_edges, label="Original list_numbers")
plt.hist(new_numbers, bins=bin_edges, label="New list_numbers")
# Add labels and legend
plt.xlabel("Value")
plt.ylabel("Count")
plt.legend()
# Show the plot
plt.show()

Plotly Express: How do I add a second colormap to heatmap object?

I created a heatmap where correlations of two entities are visualized. However, as the matrix is symmetric i added significance values below the diagonal for higher information density. As those values are usually far smaller than the ones of the correlation coefficient I want to use a second colormap to differentiate between the upper and lower diagonal of the matrix. The code is the following:
fig = px.imshow(data,
labels=dict(x="Correlation of Returns", y="", color="PCC"),
x=domain,
y=domain,
color_continuous_scale=px.colors.diverging.balance,
zmin=-1, zmax=1
)
The data object simply is my nxn matrix as a list of lists. Domain is my label values. The following graph already contains one colormap:Sample HeatMap. Is there a way to add a second one and refer it to the values below the diagonal? I didn't find a solution online yet. Thanks in advance!
Note: I am using Dash, so I may need to stick to plotly figures and won't be able to use e.g. matplotlib

Is there a way to plot Matplotlib's Imshow against a specific array rather than the indices?

I'm trying to use Imshow to plot a 2-d Fourier transform of my data. However, Imshow plots the data against its index in the array. I would like to plot the data against a set of arrays I have containing the corresponding frequency values (one array for each dim), but can't figure out how.
I have a 2D array of data (gaussian pulse signal) that I Fourier transform with np.fft.fft2. This all works fine. I then get the corresponding frequency bins for each dimension with np.fft.fftfreq(len(data))*sampling_rate. I can't figure out how to use imshow to plot the data against these frequencies though. The 1D equivalent of what I'm trying to do us using plt.plot(x,y) rather than just using plt.plot(y).
My first attempt was to use imshows "extent" flag, but as fas as I can tell that just changes the axis limits, not the actual bins.
My next solution was to use np.fft.fftshift to arrange the data in numerical order and then simply re-scale the axis using this answer: Change the axis scale of imshow. However, the index to frequency bin is not a pure scaling factor, there's typically a constant offset as well.
My attempt was to use 2d hist instead of imshow, but that doesn't work since 2dhist plots the number of times an order pair occurs, while I want to plot a scalar value corresponding to specific order pairs (i.e the power of the signal at specific frequency combinations).
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
f = 200
st = 2500
x = np.linspace(-1,1,2*st)
y = signal.gausspulse(x, fc=f, bw=0.05)
data = np.outer(np.ones(len(y)),y) # A simple example with constant y
Fdata = np.abs(np.fft.fft2(data))**2
freqx = np.fft.fftfreq(len(x))*st # What I want to plot my data against
freqy = np.fft.fftfreq(len(y))*st
plt.imshow(Fdata)
I should see a peak at (200,0) corresponding to the frequency of my signal (with some fall off around it corresponding to bandwidth), but instead my maximum occurs at some random position corresponding to the frequencie's index in my data array. If anyone has any idea, fixes, or other functions to use I would greatly appreciate it!
I cannot run your code, but I think you are looking for the extent= argument to imshow(). See the the page on origin and extent for more information.
Something like this may work?
plt.imshow(Fdata, extent=(freqx[0],freqx[-1],freqy[0],freqy[-1]))

Is there a Python package that can trace a curve with a Gaussian lineshape over several x and y values?

My apologies for my ignorance in advance; I've only been learning Python for about two months. Every example question that I've seen on Stack Overflow seems to discuss a single distribution over a series of data, but not one distribution per data point with band broadening.
I have some (essentially) infinitely-thin bars at value x with height y that I need to run a line over so that it looks like the following photo:
The bars are the obtained from the the table of data on the far right. The curve is what I'm trying to make.
I am doing some TD-DFT work to calculate a theoretical UV/visible spectrum. It will output absorbance strengths (y-values, i.e., heights) for specific wavelengths of light (x-values). Theoretically, these are typically plotted as infinitely-thin bars, though we experimentally obtain a curve instead. The theoretical data can be made to appear like an experimental spectrum by running a curve over it that hugs y=0 and has a Gaussian lineshape around every absorbance bar.
I'm not sure if there's a feature that will do this for me, or if I need to do something like make a loop summing Gaussian curves for every individual absorbance, and then plot the resulting formula.
Thanks for reading!
It looks like my answer was using Seaborn to do a kernel density estimation. Because a KDE isn't weighted and only considers the density of x-values, I had to create a small loop to create a new list consisting of the x-entries each multiplied out by their respective intensities:
for j in range(len(list1)): #list1 contains x-values
list5.append([list1[j]]*int(list3[j])) #list5 was empty; see below for list3
#now to drop the brackets from within the list:
for k in range(len(list5)): #list5 was just made, containing intensity-proportional x-values
for l in list5[k]:
list4.append(l) #now just a list, rather than a list of lists
(had to make another list earlier of the intensities multiplied by 1000000 to make them all integers):
list3 = [i * 1000000 for i in list2] #list3 now contains integer intensities

How can I account for identical data points in a scatter plot?

I'm working with some data that has several identical data points. I would like to visualize the data in a scatter plot, but scatter plotting doesn't do a good job of showing the duplicates.
If I change the alpha value, then the identical data points become darker, which is nice, but not ideal.
Is there some way to map the color of a dot to how many times it occurs in the data set? What about size? How can I assign the size of the dot to how many times it occurs in the data set?
As it was pointed out, whether this makes sense depends a bit on your dataset. If you have reasonably discrete points and exact matches make sense, you can do something like this:
import numpy as np
import matplotlib.pyplot as plt
test_x=[2,3,4,1,2,4,2]
test_y=[1,2,1,3,1,1,1] # I am just generating some test x and y values. Use your data here
#Generate a list of unique points
points=list(set(zip(test_x,test_y)))
#Generate a list of point counts
count=[len([x for x,y in zip(test_x,test_y) if x==p[0] and y==p[1]]) for p in points]
#Now for the plotting:
plot_x=[i[0] for i in points]
plot_y=[i[1] for i in points]
count=np.array(count)
plt.scatter(plot_x,plot_y,c=count,s=100*count**0.5,cmap='Spectral_r')
plt.colorbar()
plt.show()
Notice: You will need to adjust the radius (the value 100 in th s argument) according to your point density. I also used the square root of the count to scale it so that the point area is proportional to the counts.
Also note: If you have very dense points, it might be more appropriate to use a different kind of plot. Histograms for example (I personally like hexbin for 2d data) are a decent alternative in these cases.

Categories

Resources