Why is hist2d plotting UNIFORM density map? - python

I am plotting density map of ~40k points but hist2d returns a uniform density map. This is my code
hist2d(x, y, bins=(1000, 1000), cmap=plt.cm.jet)
Here is the scatter plot
Here is the histogram
I was expecting that there is a red horizontal portion in the center and the gradually turns blue towards higher/lower y values
EDIT:
#bb1 suggested decrease the number of bins but by setting it to bins=(100, 1000), I get this result

I think you are specifying too many bins. By setting bins=(1000,000) you get 1,000,000 bins. With 40,000 points, most of the bins will be empty and they overwhelm the image.
You may also consider using seaborn kdeplot() function instead of plt.hist2d(). It will visualize the density of data without subdividing data into bins:
import seaborn as sns
sns.kdeplot(x=x, y=y, levels = 100, fill=True, cmap="mako", thresh=0)

Related

Why is distplot changing the range of values plotted?

I have an array of values for which I am trying to fit a probability density function. I plotted the histogram using distplot as shown below:
x = [ 17.56,
162.52,
172.58,
160.82,
182.14,
165.86,
242.06,
135.76,
122.86,
230.22,
208.66,
271.36,
122.68,
188.42,
171.82,
102.30,
196.40,
107.38,
192.35,
179.66,
173.30,
254.66,
176.12,
75.365,
135.78,
103.66,
183.50,
166.08,
207.66,
146.22,
151.19,
172.20,
103.41,
133.93,
186.48,]
sns.distplot(x)
and the plot looks like this:
My minimum value in the array is 17 and maximum value is around 250 so I don't understand the range on the x-axis in the figure as I have not added any arguments either. Does sns.displot standardize the data before plotting?
A kde curve fits many gaussian normal curves over the data points. Such a normal curve has an infinite tail, which here is cut off when it gets close enough to zero height.
Note that sns.distplot has been deprecated since seaborn 0.11, and replaced by (in this case) sns.histplot(..., kde=True). The new kdeplot has a parameter cut= which defaults to zero, cutting the curve at the data limits (cut is one of the kde_kws in histplot: sns.histplot(x, kde=True, kde_kws={'cut': 0}). ).
import seaborn as sns
x = [17.56, 162.52, 172.58, 160.82, 182.14, 165.86, 242.06, 135.76, 122.86, 230.22, 208.66, 271.36, 122.68, 188.42,
171.82, 102.30, 196.40, 107.38, 192.35, 179.66, 173.30, 254.66, 176.12, 75.365, 135.78, 103.66, 183.50, 166.08,
207.66, 146.22, 151.19, 172.20, 103.41, 133.93, 186.48]
sns.histplot(x, kde=True)

2d histogram: Get result of full nbins x nbins

I am using matplotlib's hist2d function to make a 2d histogram of data that I have, however I am having trouble interpreting the result.
Here is the plot I have:
This was created using the line:
hist = plt.hist2d(X, Y, (160,160), norm=mpl.colors.LogNorm(vmin=1, vmax=20))
This returns a 2d array of (160, 160), as well as the bin edges etc.
In the plot there are bins which have a high frequency of values (yellow bins). I would like to be able to get the results of this histogram and filter out the bins that have low values, preserving the high bins. But I would expect there to be 160*160 values, but I can only find 160 X and 160 Y values.
What I would like to do is essentially filter out the more dense data from the less dense data. If this means representing the data as a single value (a bin), then that is ok.
Am I misinterpreting the function or am I not accessing the data results correctly? I have tried with spicy also but the results seem to be in the same or similar format.
Not sure if this is what you wanted.
The hist2d docs specify that the function returns a tuple of size 4, where the first item h is a heatmap.
This h will have the same shape as bins.
You can capture the output (it will still plot), and use argwhere to find coordinates where values exceed, say, the 90th percentile:
h, xedges, yedges, img = hist = plt.hist2d(X, Y, bins=(160,160), norm=mpl.colors.LogNorm(vmin=1, vmax=20))
print(list(np.argwhere(h > np.percentile(h, 90))))
You need Seaborn package.
You mentioned
I would like to be able to get the results of this histogram and filter out the bins that have low values, preserving the high bins.
You should definitely be using one of those:
seaborn.joinplot(...,kind='hex') : it shows the counts of observations that fall within hexagonal bins. This plot works best with relatively large dataset.
seaborn.joinplot(...,kind='kde') : use the kernel density estimation to visualize a bivariate distribution. I recommed it better.
Example 'kde'
Use number of levels n_levels and shade_lowest=False to ignore low values.
import seaborn as sns
import numpy as np
import matplotlib.pylab as plt
x, y = np.random.randn(2, 300)
plt.figure(figsize=(6,5))
sns.kdeplot(x, y, zorder=0, n_levels=6, shade=True, cbar=True,
shade_lowest=False, cmap='viridis')

Seaborn distplot() won't display frequency in the y-axis

I am trying to display the weighted frequency in the y-axis of a seaborn.distplot() graph, but it keeps displaying the density (which is the default in distplot())
I read the documentation and also many similar questions here in Stack.
The common answer is to set norm_hist=False and also to assign the weights in a bumpy array as in a standard histogram. However, it keeps showing the density and not the probability/frequency of each bin.
My code is
plt.figure(figsize=(10, 4))
plt.xlim(-0.145,0.145)
plt.axvline(0, color='grey')
data = df['col1']
x = np.random.normal(data.mean(), scale=data.std(), size=(100000))
normal_dist =sns.distplot(x, hist=False,color="red",label="Gaussian")
data_viz = sns.distplot(data,color="blue", bins=31,label="data", norm_hist=False)
# I also tried adding the weights inside the argument
#hist_kws={'weights': np.ones(len(data))/len(data)})
plt.legend(bbox_to_anchor=(1, 1), loc=1)
And I keep receiving this output:
Does anyone have an idea of what could be the problem here?
Thanks!
[EDIT]: The problem is that the y-axis is showing the kdevalues and not those from the weighted histogram. If I set kde=False then I can display the frequency in the y-axis. However, I still want to keep the kde, so I am not considering that option.
Keeping the kde and the frequency/count in one y-axis in one plot will not work because they have different scales. So it might be better to create a plot with 2 axis with each showing the kde and histogram separately.
From documentation norm_hist If True, the histogram height shows a density rather than a count. **This is implied if a KDE or fitted density is plotted**.
versusnja in https://github.com/mwaskom/seaborn/issues/479 has a workaround:
# Plot hist without kde.
# Create another Y axis.
# Plot kde without hist on the second Y axis.
# Remove Y ticks from the second axis.
first_ax = sns.distplot(data, kde=False)
second_ax = ax.twinx()
sns.distplot(data, ax=second_ax, kde=True, hist=False)
second_ax.set_yticks([])
If you need this just for visualization it should be good enough.

Smooth histogram in python

In my program, I calculate N amounts of three parameters and want to create three histograms for each parameter. I have strict conditions for the histograms. Firstly, it is a condition on the range (at some points histogram should go to zero strictly), and it should be smooth.
I use np.histogram, as following:
hist, bins = np.histogram(Gamm1, bins=100)
center = bins[:-1]
plt.plot(center, hist)
plt.show()
but the solution is too sharp. After that, I use the following construction with seaborn,
snsplot = sns.kdeplot(data['Third'], shade=True)
fig = snsplot.get_figure()
fig.savefig("output2.png")
but here approximation goes out of range (range is created from physical conditions).
I think that changing in bins for a seaborn solution, as it could be done for np.histogram, can help.
But, in the end, I'm looking for some solution, which will be smooth and into given by me range.

Matplotlib - Boxplot calculated on log10 values but shown in logarithmic scale

I think this is a simple question, but I just still can't seem to think of a simple solution. I have a set of data of molecular abundances, with values ranging many orders of magnitude. I want to represent these abundances with boxplots (box-and-whiskers plots), and I want the boxes to be calculated on log scale because of the wide range of values.
I know I can just calculate the log10 of the data and send it to matplotlib's boxplot, but this does not retain the logarithmic scale in plots later.
So my question is basically this:
When I have calculated a boxplot based on the log10 of my values, how do I convert the plot afterward to be shown on a logarithmic scale instead of linear with the log10 values?
I can change tick labels to partly fix this, but I have no clue how I get logarithmic scales back to the plot.
Or is there another more direct way to plotting this. A different package maybe that has this options already included?
Many thanks for the help.
I'd advice against doing the boxplot on the raw values and setting the y-axis to logarithmic, because the boxplot function is not designed to work across orders of magnitudes and you may get too many outliers (depends on your data, of course).
Instead, you can plot the logarithm of the data and manually adjust the y-labels.
Here is a very crude example:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
np.random.seed(42)
values = 10 ** np.random.uniform(-3, 3, size=100)
fig = plt.figure(figsize=(9, 3))
ax = plt.subplot(1, 3, 1)
ax.boxplot(np.log10(values))
ax.set_yticks(np.arange(-3, 4))
ax.set_yticklabels(10.0**np.arange(-3, 4))
ax.set_title('log')
ax = plt.subplot(1, 3, 2)
ax.boxplot(values)
ax.set_yscale('log')
ax.set_title('raw')
ax = plt.subplot(1, 3, 3)
ax.boxplot(values, whis=[5, 95])
ax.set_yscale('log')
ax.set_title('5%')
plt.show()
The right figure shows the box plot on the raw values. This leads to many outliers, because the maximum whisker length is computed as a multiple (default: 1.5) of the interquartile range (the box height), which does not scale across orders of magnitude.
Alternatively, you could specify to draw the whiskers for a given percentile range:
ax.boxplot(values, whis=[5, 95])
In this case you get a fixed amount of outlires (5%) above and below.
You can use plt.yscale:
plt.boxplot(data); plt.yscale('log')

Categories

Resources