Related
The initial step is a pandas Dataframe with several columns.
Th second step I did is to convert some columns of this Dataframe in a Numpy array using to_numpy() function.
I retrieve something like:
[[100 200 3.5 1] [100 200 3.5 1] [100 300 6.2 1] [200 125 4.2 1] [100 300 6.2 1] [100 200 3.5 1]]
Where the first element imagine that is an origin id
the second element is a destiny id
the 3rd is the distance between origin a destiny
and the 4th is just a counter (1 element) (I have included it just because I think that could be required to count elements. Just ignore it if your proposed solution doesn't use it)
I would like to have a scatterplot with the following specifications:
origing_id in x axis
destiny_id in y axis
color of the scatter dot in a warm scale that
indicates distance between both points (3rd element)
size of the
scatter dot depends on the number of pairs of origins_id
/destiny_id we have.for example we have three 100 200
combinations. So its size should be bigger that the one for
combintion 200 125 that only has one entry.
I have tried but I'm not able to include all prerequisites in this plot.
How this could be achieved in matplotlib? Or is there any other easier approach using pandas directly?
If I understood your requirements correctly, this should do the trick:
import matplotlib.pyplot as plt
import numpy as np
data = np.array([[100,200,3.5,1],[100,200,3.5,1],[100,300,6.2,1],[200,125,4.2,1],[100,300,6.2,1],[100,200,3.5,1]])
unique, counts = np.unique(data, axis=0, return_counts=True)
x = unique[:,0]
y = unique[:,1]
c = unique[:,2]
## figure out a nice looking scaling factor here
# and remember that the scatter point size is supposed to be an area,
# hence squaring a base factor is ideal
s = (counts*10)**2
fig, ax = plt.subplots()
sca = ax.scatter(x,y,c=c,s=s)
plt.colorbar(sca)
plt.show()
which yields:
I have two arrays:
Obs=([])
abs_error=([])
I want to use Obs to define the bins. For example, Where Obs is 1 to 2, bin abs_error into bin#1. Then where Obs is 2 to 3, bin abs_error into bin#2. etc.
Once I have my binned abs_error (which was binned by Obs) I want to calculate the mean of each bin and then plot the mean of each bin on the y-axis vs the bins on the x-axis.
How do I go about binning the abs_error by bins defined by the Obs? And how do I calculate the mean of each bin once this is done?
Right now I have:
abs_error=np.array([2.214033842086792 2.65031099319458 2.021354913711548 ... 2.831442356109619 1.9227538108825684 0.19358205795288086])
obs=np.array([3.3399999141693115 1.440000057220459 1.2799999713897705 ... 5.78000020980835 6.050000190734863 7.75])
bin_boundaries=np.array([0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0])
idx = np.digitize(obs, bin_boundaries)
mn_ = np.bincount(idx,abs_error) / np.bincount(idx)
print mn
[83.09254473 3.18577858 2.82887524 2.78532805 2.43264693 1.96835116 1.77645996 1.66138196 1.5972414 1.57512014 1.53094066 1.7965252 1.98050336 2.29916244 3.06640482 4.66769505 3.16787195]
I can't print the whole arrays because they are very big.
If your bins are all the same size you can use floor division to obtain bin indices from Obs, in your example.
idx = (Obs // 1).astype(int)
If not use np.digitize instead.
idx = np.digitize(Obs, bin_boundaries)
Once you have indices use them with np.bincount to obtain the means.
mn = np.bincount(idx, abs_error) / np.bincount(idx)
I have 16 guassian curves which I have to fit with one guassian curve. I was unable to imply the sum of guassian(multiple regression) in python.
Here is the code I am using:
import matplotlib.pyplot as plt
import numpy as np
a=np.array([3750.0, -250.0, 6750.0, 2750.0, -2050.0, 6350.0, 1550.0, -4050.0, 5750.0, 150.0, -6250.0, 4950.0, -1450.0, -8650.0, 3950.0, -3250.0])
v1=np.array( [2.5470357695283954, 0.1937004980283323, 0.43831655553839766, 6.07645636407398, 0.6331239135554633, 0.969937308645575, 13.38133838752005, 1.3226417845166933, 1.5531178254607325, 27.599625693090765, 2.031000233294804, 1.635762971986014, 53.83073800155456, 2.0719664311822843, 0.0, 100.0])
x=[]
s=[]
v5=9.9e2
for j in range(0,len(a)):
for i in range(-1500,1500):
v11=a[j]+i
x.append(v11)
z=np.exp((-4*np.log(2)*((v11-a[j])/(v5))**2))*((4.5*np.log(2)/(np.pi))**0.5)
s.append(z*v1[j])
plt.plot(x,s,'--r',)
plt.stem(a,v1)
Which generates the following plot (with the problem circled):
Instead of the desired output:
The output of your code shows this overlapping because you are not summing the 16 gaussians but instead creating an array containing [x1_g1,x1_g1,...,x3000_g1,x1_g2,...,x3000_g16] and the same for s. It is a 1d array containing the 3000 x values of the first gaussian, then the 3000 x values of the second gaussian and so on. But they are not added. Thus, the plot shows the 16 independent gaussians instead of the sum which is the desired output.
In the actual code, the x values of each gaussian are different (going -1500 and +1500 around its center) which makes adding the 16 gaussians more complicated.
If we consider only the first 2 gaussians for instance, centered at 3750 and -250, the values appended in x from the first gaussian go from 2250 to 5250 in steps of 1, as well as their images in s which are s(2250)... Afterwards, the values of the second gaussian (x between -1750 and 1250) are appended (not added), which will result in an x list like that:
x = [2250,2251,<in steps of 1>,5249,5250,-1750,-1749,<in steps of 1>,1250]
And s is a list where each position contains the image of the same position in x. Strating from this format, getting the final output which is the sum of the gaussians id difficult, because we wolud have to check for equivalent values of x, and sum their contributions...
However, if instead we always evaluated the gaussians at the same positions (in the exemple between -1750 and 5250 in steps of 1), we will have much more values stored, and most of them will be zero, but adding them will be straightforward.
Half-way vectorization
One option similar to the code in the question is the following:
a = np.array([3750.0, -250.0, 6750.0, 2750.0, -2050.0, 6350.0, 1550.0, -4050.0, 5750.0, 150.0, -6250.0, 4950.0, -1450.0, -8650.0, 3950.0, -3250.0])
v1 = np.array( [2.5470357695283954, 0.1937004980283323, 0.43831655553839766, 6.07645636407398, 0.6331239135554633, 0.969937308645575, 13.38133838752005, 1.3226417845166933, 1.5531178254607325, 27.599625693090765, 2.031000233294804, 1.635762971986014, 53.83073800155456, 2.0719664311822843, 0.0, 100.0])
v5 = 9.9e2
xrange = np.arange(a.min()-1500,a.max()+1500)
# This generates an array between the minimum of a minus 1500 and the maximum of a
# plus 1500. This way, all the values in the old x list are contained in ths array
# Therefore, it becomes really easy to sum the contribution of each gaussian,
# because only an element-wise sum is needed.
s = np.zeros(len(xrange))
for j,aj in enumerate(a):
z = np.exp((-4*np.log(2)*((xrange-aj)/(v5))**2))*((4.5*np.log(2)/(np.pi))**0.5)
s += z*v1[j]
plt.plot(xrange,s,'--r')
plt.stem(a,v1)
The output plot is the same as for the completely vectorized solution.
Completely vectorized solution
One simple solution is to define a unique xrange for all 16 gaussians, then calculate s for each of them (on the same x values) and finally sum over the 16 gaussians:
a = np.array([3750.0, -250.0, 6750.0, 2750.0, -2050.0, 6350.0, 1550.0, -4050.0, 5750.0, 150.0, -6250.0, 4950.0, -1450.0, -8650.0, 3950.0, -3250.0])
v1 = np.array( [2.5470357695283954, 0.1937004980283323, 0.43831655553839766, 6.07645636407398, 0.6331239135554633, 0.969937308645575, 13.38133838752005, 1.3226417845166933, 1.5531178254607325, 27.599625693090765, 2.031000233294804, 1.635762971986014, 53.83073800155456, 2.0719664311822843, 0.0, 100.0])
v5 = 9.9e2
xrange = np.arange(a.min()-1500,a.max()+1500)
z = np.exp((-4*np.log(2)*((xrange-a.reshape((len(a),1)))/(v5))**2))*((4.5*np.log(2)/(np.pi))**0.5)
s = z*v1.reshape((len(a),1))
plt.plot(xrange,s.sum(axis=0),'--r')
plt.stem(a,v1)
Note that I have removed the 2 nested loops using numpy.
The loop over range(-1500,1500) can be avoided defining i=np.arange(-1500,1500) instead of the for i in ... and leaving the rest of the code untouched (only indentation has to be updated). Thet is because numpy operated element-wise over the arrays.
The second loop is a bit trickier than that. The a and v1 arrays are reshaped to a 2d array, in order to generate a z with the shape (16,len(xrange)). Thas is why combining an array xrange of length muxh larger than 16 with a does not raise any error of dimensions not matching, because one is the 1st dimension and the other the second.
The code above generates the following plot:
Groupby solution
There is also the option of working with the same code to generate x and s and afterwards, plot every unique value of x (the same value of x can be found in x[i1],x[i2],x[i3]) versus s[i1]+s[i2]+s[i3].
This can be done adding the following code after the loops:
x,s = np.array(x),np.array(s)
ind = np.argsort(x)
x,s = x[ind],s[ind]
unique_x = np.unique(x)
catsums=[]
for k in unique_x:
catsums.append(np.sum(s[np.where(x==k)]))
plt.plot(u,catsums,'--r')
plt.stem(a,v1)
This groupby can also be vectorized using numpy or pandas as it is explained in this other SO answer
I'd like to make a set of comparable empirical CDFs for a few numpy arrays (each of different length) and store these in a pandas dataframe:
a = scipy.randn(100)
b = scipy.randn(500)
# ECDF from statmodels
cdf_a = ECDF(a)
cdf_b = ECDF(b)
The problem is that cdf_a.x, cdf_a.y will be of different lengths of cdf_b.x, cdf_b.y and I would like these to be the same length, i.e. use same number of bins to compute the CDF so that these can be plotted on same scale from a pandas DataFrame. This is not possible:
df = pandas.DataFrame({"cdf_a": cdf_a.y, "cdf_b": cdf_b.y})
Since the cdfs are not of the same length. How can I bin a and b using similar bins when computing their CDFs, so that I get comparable same-length vectors back?
Is this the best solution?
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
The way we use it in some goodness of fit tests is to stack the arrays, so they are defined on all points, points from both arrays.
Then use np.searchsorted to get the ranking, number of points in dataset 1 below x and number of points in dataset 2 below x.
If I remember correctly, look at scipy.stats.ks_2samp
data1 = np.sort(data1)
data2 = np.sort(data2)
data_all = np.concatenate([data1,data2])
cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2)
It appears that this is a good solution:
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
Then len(v1) == len(v2) and these can be plotted as CDFs of a, b on the same scale.
I have a 2-d array containing pairs of values and I'd like to make a boxplot of the y-values by different bins of the x-values. I.e. if the array is:
my_array = array([[1, 40.5], [4.5, 60], ...]])
then I'd like to bin my_array[:, 0] and then for each of the bins, produce a boxplot of the corresponding my_array[:, 1] values that fall into each box. So in the end I want the plot to contain number of bins-many box plots.
I tried the following:
min_x = min(my_array[:, 0])
max_x = max(my_array[:, 1])
num_bins = 3
bins = linspace(min_x, max_x, num_bins)
elts_to_bins = digitize(my_array[:, 0], bins)
However, this gives me values in elts_to_bins that range from 1 to 3. I thought I should get 0-based indices for the bins, and I only wanted 3 bins. I'm assuming this is due to some trickyness with how bins are represented in linspace vs. digitize.
What is the easiest way to achieve this? I want num_bins-many equally spaced bins, with the first bin containing the lower half of the data and the upper bin containing the upper half... i.e., I want each data point to fall into some bin, so that I can make a boxplot.
thanks.
You're getting the 3rd bin for the maximum value in the array (I'm assuming you have a typo there, and max_x should be "max(my_array[:,0])" instead of "max(my_array[:,1])"). You can avoid this by adding 1 (or any positive number) to the last bin.
Also, if I'm understanding you correctly, you want to bin one variable by another, so my example below shows that. If you're using recarrays (which are much slower) there are also several functions in matplotlib.mlab (e.g. mlab.rec_groupby, etc) that do this sort of thing.
Anyway, in the end, you might have something like this (to bin x by the values in y, assuming x and y are the same length)
def bin_by(x, y, nbins=30):
"""
Bin x by y.
Returns the binned "x" values and the left edges of the bins
"""
bins = np.linspace(y.min(), y.max(), nbins+1)
# To avoid extra bin for the max value
bins[-1] += 1
indicies = np.digitize(y, bins)
output = []
for i in xrange(1, len(bins)):
output.append(x[indicies==i])
# Just return the left edges of the bins
bins = bins[:-1]
return output, bins
As a quick example:
In [3]: x = np.random.random((100, 2))
In [4]: binned_values, bins = bin_by(x[:,0], x[:,1], 2)
In [5]: binned_values
Out[5]:
[array([ 0.59649575, 0.07082605, 0.7191498 , 0.4026375 , 0.06611863,
0.01473529, 0.45487203, 0.39942696, 0.02342408, 0.04669615,
0.58294003, 0.59510434, 0.76255006, 0.76685052, 0.26108928,
0.7640156 , 0.01771553, 0.38212975, 0.74417014, 0.38217517,
0.73909022, 0.21068663, 0.9103707 , 0.83556636, 0.34277006,
0.38007865, 0.18697416, 0.64370535, 0.68292336, 0.26142583,
0.50457354, 0.63071319, 0.87525221, 0.86509534, 0.96382375,
0.57556343, 0.55860405, 0.36392931, 0.93638048, 0.66889756,
0.46140831, 0.01675165, 0.15401495, 0.10813141, 0.03876953,
0.65967335, 0.86803192, 0.94835281, 0.44950182]),
array([ 0.9249993 , 0.02682873, 0.89439141, 0.26415792, 0.42771144,
0.12292614, 0.44790357, 0.64692616, 0.14871052, 0.55611472,
0.72340179, 0.55335053, 0.07967047, 0.95725514, 0.49737279,
0.99213794, 0.7604765 , 0.56719713, 0.77828727, 0.77046566,
0.15060196, 0.39199123, 0.78904624, 0.59974575, 0.6965413 ,
0.52664095, 0.28629324, 0.21838664, 0.47305751, 0.3544522 ,
0.57704906, 0.1023201 , 0.76861237, 0.88862359, 0.29310836,
0.22079126, 0.84966201, 0.9376939 , 0.95449215, 0.10856864,
0.86655289, 0.57835533, 0.32831162, 0.1673871 , 0.55742108,
0.02436965, 0.45261232, 0.31552715, 0.56666458, 0.24757898,
0.8674747 ])]
Hope that helps a bit!
Numpy has a dedicated function for creating histograms the way you need to:
histogram(a, bins=10, range=None, normed=False, weights=None, new=None)
which you can use like:
(hist_data, bin_edges) = histogram(my_array[:,0], weights=my_array[:,1])
The key point here is to use the weights argument: each value a[i] will contribute weights[i] to the histogram. Example:
a = [0, 1]
weights = [10, 2]
describes 10 points at x = 0 and 2 points at x = 1.
You can set the number of bins, or the bin limits, with the bins argument (see the official documentation for more details).
The histogram can then be plotted with something like:
bar(bin_edges[:-1], hist_data)
If you only need to do a histogram plot, the similar hist() function can directly plot the histogram:
hist(my_array[:,0], weights=my_array[:,1])