making binned boxplot in matplotlib with numpy and scipy in Python - python

I have a 2-d array containing pairs of values and I'd like to make a boxplot of the y-values by different bins of the x-values. I.e. if the array is:
my_array = array([[1, 40.5], [4.5, 60], ...]])
then I'd like to bin my_array[:, 0] and then for each of the bins, produce a boxplot of the corresponding my_array[:, 1] values that fall into each box. So in the end I want the plot to contain number of bins-many box plots.
I tried the following:
min_x = min(my_array[:, 0])
max_x = max(my_array[:, 1])
num_bins = 3
bins = linspace(min_x, max_x, num_bins)
elts_to_bins = digitize(my_array[:, 0], bins)
However, this gives me values in elts_to_bins that range from 1 to 3. I thought I should get 0-based indices for the bins, and I only wanted 3 bins. I'm assuming this is due to some trickyness with how bins are represented in linspace vs. digitize.
What is the easiest way to achieve this? I want num_bins-many equally spaced bins, with the first bin containing the lower half of the data and the upper bin containing the upper half... i.e., I want each data point to fall into some bin, so that I can make a boxplot.
thanks.

You're getting the 3rd bin for the maximum value in the array (I'm assuming you have a typo there, and max_x should be "max(my_array[:,0])" instead of "max(my_array[:,1])"). You can avoid this by adding 1 (or any positive number) to the last bin.
Also, if I'm understanding you correctly, you want to bin one variable by another, so my example below shows that. If you're using recarrays (which are much slower) there are also several functions in matplotlib.mlab (e.g. mlab.rec_groupby, etc) that do this sort of thing.
Anyway, in the end, you might have something like this (to bin x by the values in y, assuming x and y are the same length)
def bin_by(x, y, nbins=30):
"""
Bin x by y.
Returns the binned "x" values and the left edges of the bins
"""
bins = np.linspace(y.min(), y.max(), nbins+1)
# To avoid extra bin for the max value
bins[-1] += 1
indicies = np.digitize(y, bins)
output = []
for i in xrange(1, len(bins)):
output.append(x[indicies==i])
# Just return the left edges of the bins
bins = bins[:-1]
return output, bins
As a quick example:
In [3]: x = np.random.random((100, 2))
In [4]: binned_values, bins = bin_by(x[:,0], x[:,1], 2)
In [5]: binned_values
Out[5]:
[array([ 0.59649575, 0.07082605, 0.7191498 , 0.4026375 , 0.06611863,
0.01473529, 0.45487203, 0.39942696, 0.02342408, 0.04669615,
0.58294003, 0.59510434, 0.76255006, 0.76685052, 0.26108928,
0.7640156 , 0.01771553, 0.38212975, 0.74417014, 0.38217517,
0.73909022, 0.21068663, 0.9103707 , 0.83556636, 0.34277006,
0.38007865, 0.18697416, 0.64370535, 0.68292336, 0.26142583,
0.50457354, 0.63071319, 0.87525221, 0.86509534, 0.96382375,
0.57556343, 0.55860405, 0.36392931, 0.93638048, 0.66889756,
0.46140831, 0.01675165, 0.15401495, 0.10813141, 0.03876953,
0.65967335, 0.86803192, 0.94835281, 0.44950182]),
array([ 0.9249993 , 0.02682873, 0.89439141, 0.26415792, 0.42771144,
0.12292614, 0.44790357, 0.64692616, 0.14871052, 0.55611472,
0.72340179, 0.55335053, 0.07967047, 0.95725514, 0.49737279,
0.99213794, 0.7604765 , 0.56719713, 0.77828727, 0.77046566,
0.15060196, 0.39199123, 0.78904624, 0.59974575, 0.6965413 ,
0.52664095, 0.28629324, 0.21838664, 0.47305751, 0.3544522 ,
0.57704906, 0.1023201 , 0.76861237, 0.88862359, 0.29310836,
0.22079126, 0.84966201, 0.9376939 , 0.95449215, 0.10856864,
0.86655289, 0.57835533, 0.32831162, 0.1673871 , 0.55742108,
0.02436965, 0.45261232, 0.31552715, 0.56666458, 0.24757898,
0.8674747 ])]
Hope that helps a bit!

Numpy has a dedicated function for creating histograms the way you need to:
histogram(a, bins=10, range=None, normed=False, weights=None, new=None)
which you can use like:
(hist_data, bin_edges) = histogram(my_array[:,0], weights=my_array[:,1])
The key point here is to use the weights argument: each value a[i] will contribute weights[i] to the histogram. Example:
a = [0, 1]
weights = [10, 2]
describes 10 points at x = 0 and 2 points at x = 1.
You can set the number of bins, or the bin limits, with the bins argument (see the official documentation for more details).
The histogram can then be plotted with something like:
bar(bin_edges[:-1], hist_data)
If you only need to do a histogram plot, the similar hist() function can directly plot the histogram:
hist(my_array[:,0], weights=my_array[:,1])

Related

normalization of values in python np array gone wrong?

I have a matrix of floats shaped (3000, 9).
Across 1 line, there is 1 ''simulation''.
Across columns, for a fixed line, there's the contents of the ''simulation''.
I want that for each simulation, the first 8 columns to be normalized to the sum of the 8 first columns.
That is, the first column's entry (for one fixed line) to become what was before, over the sum of the first 8 columns (for that same fixed line).
A trivial task, but I get from a nice, correct, graph (non-normalized), something totally unphysical when plotting with plt.scatter.
The last column of each line is what we are going to use for the x-axis to plot the first 8 columns (the y values).
So one line will represent 8 datapoints for 1 fixed value of x.
The non-normalized graph:
https://ibb.co/Msr8RVB
The normalized graph:
https://ibb.co/tJp7bZn
The datasets:
non-normalized: https://easyupload.io/oat9kq
My code:
import numpy as np
from matplotlib import pyplot as plt
non_norm = np.loadtxt("integration_results_3000samples_10_20_10_25_Wcm2_BenSimulationFromSlack.txt")
plt.figure()
for i in range(non_norm.shape[1]-1):
plt.scatter(non_norm[:, -1], non_norm[:, i], label="c_{}".format(i+47))
plt.xscale("log")
plt.savefig("non-norm_Ben3000samples.pdf", bbox_inches='tight')
norm = np.empty( (non_norm.shape[0], non_norm.shape[1]) )
norm[:, -1] = non_norm[:, -1]
for i in range(norm.shape[1]-1):
for j in range(norm.shape[0]):
norm[j, i] = np.true_divide(non_norm[j, i] , np.sum(non_norm[j, :-1]))
plt.figure()
for i in range(norm.shape[1]-1):
plt.scatter(norm[:, -1], norm[:, i], label="c_{}".format(i+47))
plt.xscale("log")
plt.savefig("norm_Ben3000samples.pdf", bbox_inches='tight')
Do you see what went wrong?
Thank you
When you're normalising a row that has just one value and 7 zeroes, the value becomes 1 and the rest of the row is 0? This is likely why your plot is messing up.
For example, the plot for the first column looks like this before and after normalization:

How to find what points lie in each bin of a histogram?

I have a 2D dimensional histogram having bin size 10. I wish to know whether there is a numpy function (or any faster method) to obtain what points lie in each bin in the 2d grid. Is there a way to access the bin elements?
I hope this solve your problem. However, I believe other can improve my code because I am new in python.
Create Histogram with matplotlib
import matplotlib.pyplot as plt
rng = np.random.RandomState(10) # deterministic random data
a = np.hstack((rng.normal(size=100), rng.normal(loc=5, scale=2, size=1000)))
n ,bins ,patches = plt.hist(a, bins=10) # arguments are passed to np.histogram
plt.title("Histogram with '10' bins")
plt.show()
Reshape arrays and..
newbin = np.repeat(np.reshape(bins,(-1, len(bins))), a.shape, axis=0)
newa = np.repeat(np.reshape(a,(len(a),-1)),len(bins),axis=1)
#index_bin = (np.where(newbin[:,0] >np.reshape(a,(1,-1))[:,0] ) )[0][0]
index_bin = (newbin>newa).argmax(axis=1).T
test
print(a[0] , bins)
print(index_bin[0])
Output
1.331586504129518 [-2.13171211 -0.88255884 0.36659444 1.61574771 2.86490098 4.11405425
5.36320753 6.6123608 7.86151407 9.11066734 10.35982062]
3

how do I count the number of points in each bin?

I have a pandas df with x,y coordinates and wanted to know how I can count the number of points in each bin. I know you can visualise this using a plt.hist2d() but I wanted to make some sort of array/matrix that holds the counts per bin.
Ive binned my x,y coordinates using:
bins = (df // .1 * .1).round(1).stack().groupby(level=0).apply(tuple)
where df is:
x y
-2.319059 -4.057801
1.514416 -2.325972
-2.642251 -1.004367
-1.486476 -2.535654
-0.844162 -3.078726
-2.376592 -1.471239
-3.139233 0.449457
:
etc
and bins is:
0 (-2.4, -4.1)
1 (1.5, -2.4)
3 (-2.7, -1.1)
4 (-1.5, -2.6)
6 (-0.9, -3.1)
7 (-2.4, -1.5)
8 (-3.2, 0.4)
:
etc
I tried to make an empty numpy array using:
x_size = int(max(list(df['x'])))
y_size = int(max(list(df['y'])))
my_array = np.zeros((x_size+1,y_size+1), np.int16)
but im not sure how i relate the bin coordinates to the array coordinates in order to count them..
Simply groupby your bins and use GroupBy.count method
bins.groupby(bins).count()

Getting CDF of variable-sized numpy arrays in Python using same bins?

I'd like to make a set of comparable empirical CDFs for a few numpy arrays (each of different length) and store these in a pandas dataframe:
a = scipy.randn(100)
b = scipy.randn(500)
# ECDF from statmodels
cdf_a = ECDF(a)
cdf_b = ECDF(b)
The problem is that cdf_a.x, cdf_a.y will be of different lengths of cdf_b.x, cdf_b.y and I would like these to be the same length, i.e. use same number of bins to compute the CDF so that these can be plotted on same scale from a pandas DataFrame. This is not possible:
df = pandas.DataFrame({"cdf_a": cdf_a.y, "cdf_b": cdf_b.y})
Since the cdfs are not of the same length. How can I bin a and b using similar bins when computing their CDFs, so that I get comparable same-length vectors back?
Is this the best solution?
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
The way we use it in some goodness of fit tests is to stack the arrays, so they are defined on all points, points from both arrays.
Then use np.searchsorted to get the ranking, number of points in dataset 1 below x and number of points in dataset 2 below x.
If I remember correctly, look at scipy.stats.ks_2samp
data1 = np.sort(data1)
data2 = np.sort(data2)
data_all = np.concatenate([data1,data2])
cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2)
It appears that this is a good solution:
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
Then len(v1) == len(v2) and these can be plotted as CDFs of a, b on the same scale.

plotting high precision data

I have an array which contains error values as a function of two different quantities (alpha and eigRange).
I fill my array like this :
for j in range(n):
for i in range(alphaLen):
alpha = alpha_list[i]
c = train.eig(xt_, yt_,m-j, m,alpha, "cpu")
costListTrain[j, i] = cost.err(xt_, xt_, yt_, c)
normedValues=costListTrain/np.max(costListTrain.ravel())
where
n = 20
alpha_list = [0.0001,0.0003,0.0008,0.001,0.003,0.006,0.01,0.03,0.05]
My costListTrain array contains some values that have very small differences, e.g.:
2.809458902485728 2.809458905776425 2.809458913576337 2.809459011062461
2.030326752376704 2.030329906064879 2.030337351188699 2.030428976282031
1.919840839066182 1.919846470077076 1.919859731440199 1.920021453630778
1.858436351617677 1.858444223016128 1.858462730482461 1.858687054377165
1.475871326997542 1.475901926855846 1.475973476249240 1.476822830933632
1.475775410801635 1.475806023102173 1.475877601316863 1.476727286424228
1.475774284270633 1.475804896751524 1.475876475382906 1.476726165223209
1.463578292548192 1.463611627166494 1.463689466240788 1.464609083309240
1.462859608038034 1.462893157900139 1.462971489632478 1.463896516033939
1.461912706143012 1.461954067956570 1.462047793798572 1.463079574605320
1.450581041157659 1.452770209885761 1.454835202839513 1.459676311335618
1.450581041157643 1.452770209885764 1.454835202839484 1.459676311335624
1.450581041157651 1.452770209885735 1.454835202839484 1.459676311335610
1.450581041157597 1.452770209885784 1.454835202839503 1.459676311335620
1.450581041157575 1.452770209885757 1.454835202839496 1.459676311335619
1.450581041157716 1.452770209885711 1.454835202839499 1.459676311335613
1.450581041157667 1.452770209885744 1.454835202839509 1.459676311335625
1.450581041157649 1.452770209885750 1.454835202839476 1.459676311335617
1.450581041157655 1.452770209885708 1.454835202839442 1.459676311335622
1.450581041157571 1.452770209885700 1.454835202839498 1.459676311335622
as you can here the value are very very close together!
I am trying to plotting this data in a way where I have the two quantities in the x, y axes and the error value is represented by the dot color.
This is how I'm plotting my data:
alpha_list = np.log(alpha_list)
eigenvalues, alphaa = np.meshgrid(eigRange, alpha_list)
vMin = np.min(costListTrain)
vMax = np.max(costListTrain)
plt.scatter(x, y, s=70, c=normedValues, vmin=vMin, vmax=vMax, alpha=0.50)
but the result is not correct.
I tried to normalize my error value by dividing all values by the max, but it didn't work !
The only way that I could make it work (which is incorrect) is to normalize my data in two different ways. One is base on each column (which means factor1 is constant, factor 2 changing), and the other one based on row (means factor 2 is constant and factor one changing). But it doesn't really make sense because I need a single plot to show the tradeoff between the two quantities on the error values.
UPDATE
this is what I mean by last paragraph.
normalizing values base on max on each rows which correspond to eigenvalues:
maxsEigBasedTrain= np.amax(costListTrain.T,1)[:,np.newaxis]
maxsEigBasedTest= np.amax(costListTest.T,1)[:,np.newaxis]
normEigCostTrain=costListTrain.T/maxsEigBasedTrain
normEigCostTest=costListTest.T/maxsEigBasedTest
normalizing values base on max on each column which correspond to alphas:
maxsAlphaBasedTrain= np.amax(costListTrain,1)[:,np.newaxis]
maxsAlphaBasedTest= np.amax(costListTest,1)[:,np.newaxis]
normAlphaCostTrain=costListTrain/maxsAlphaBasedTrain
normAlphaCostTest=costListTest/maxsAlphaBasedTest
plot 1:
where no. eigenvalue = 10 and alpha changes (should correspond to column 10 of plot 1) :
where alpha = 0.0001 and eigenvalues change (should correspond to first row of plot1)
but as you can see the results are different from plot 1!
UPDATE:
just to clarify more stuff this is how I read my data:
from sklearn.datasets.samples_generator import make_regression
rng = np.random.RandomState(0)
diabetes = datasets.load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target
X_diabetes=np.c_[np.ones(len(X_diabetes)),X_diabetes]
ind = np.arange(X_diabetes.shape[0])
rng.shuffle(ind)
#===============================================================================
# Split Data
#===============================================================================
import math
cross= math.ceil(0.7*len(X_diabetes))
ind_train = ind[:cross]
X_train, y_train = X_diabetes[ind_train], y_diabetes[ind_train]
ind_val=ind[cross:]
X_val,y_val= X_diabetes[ind_val], y_diabetes[ind_val]
I also uploaded .csv files HERE
log.csv contain the original value before normalization for plot 1
normalizedLog.csv for plot 1
eigenConst.csv for plot 2
alphaConst.csv for plot 3
I think I found the answer. First of all there was one problem in my code. I was expecting the "No. of eigenvalue" correspond to rows but in my for loop they fill the columns. The currect answer is this :
for i in range(alphaLen):
for j in range(n):
alpha=alpha_list[i]
c=train.eig(xt_, yt_,m-j,m,alpha,"cpu")
costListTrain[i,j]=cost.err(xt_,xt_,yt_,c)
costListTest[i,j]=cost.err(xt_,xv_,yv_,c)
After asking questions from friends and colleagues I got this answer :
I would assume on default imshow and other plotting commands you
might want to use, do equally sized intervals on the values you are
plotting. if you can set that to logarithmic you should be fine.
Ideally, equally "populated bins" would proof most effective, i guess.
for plotting I just subtract the min value from the error and the add a small number and at the end take the log.
temp=costListTrain- costListTrain.min()
temp+=0.00000001
extent = [0, 20,alpha_list[0], alpha_list[-1]]
plt.imshow(np.log(temp),interpolation="nearest",cmap=plt.get_cmap('spectral'), extent = extent, origin="lower")
plt.colorbar()
and result is :

Categories

Resources