average point on each bin pandas - python

I have 2 dataframes temperature(y) and ratio(x). In each dataframe I have 60 columns corresponding to 60 different machines that measure both parameters.
for now I have a plot for each machine of y vs x, as follow:
for column in ratio.columns:
x = ratio[column]
y = temperature[column]
if len(x) != len(y):
x_ind = x.index
y_ind = y.index
common_ind = x_ind.intersection(y_ind)
x = x[common_ind]
y = y[common_ind]
plt.scatter(x,y)
plt.savefig("plot" +column+".png")
plt.clf()
because I have a lot of data points, I want to do binning for each machine and to do an average on each bin, so that I will have an average point of y for each bin.
x is between 0 and 1 and I want to bin every 0.05, which gives 20 bins.
I got an histogram for each machine by doing:
for x in ratio.columns:
ratio.hist(column = x, bins = 20)
but this is only giving number of events vs ratio.
how can I link the temperature dataframe
I am new to pandas and I can't figure out how to do this

mask bin every 20
mask = my_df.index//20
then use groupby and agg
my_df.groupby(mask).agg(['mean'])

Related

Having some problem to understand the x_bin in regplot of Seaborn

I used the seaborn.regplot to plot data, but not quite understand how the error bar in regplot was calculated. I have compared the results with the mean and standard deviation derived from mannual calculation. Here is my testing script.
import numpy as np
import pandas as pd
import seaborn as sn
def get_data_XYE(p):
x_list = []
lower_list = []
upper_list = []
for line in p.lines:
x_list.append(line.get_xdata()[0])
lower_list.append(line.get_ydata()[0])
upper_list.append(line.get_ydata()[1])
y = 0.5 * (np.asarray(lower_list) + np.asarray(upper_list))
y_error = np.asarray(upper_list) - y
x = np.asarray(x_list)
return x, y, y_error
x = [37.3448,36.6026,42.7795,34.7072,75.4027,226.2615,192.7984,140.8045,242.9952,458.451,640.6542,726.1024,231.7347,107.5605,200.2254,190.0006,314.1349,146.8131,152.4497,175.9096,284.9926,116.9681,118.2953,312.3787,815.8389,458.0146,409.5797,595.5373,188.9955,15.7716,36.1839,244.8689,57.4579,94.8717,112.2237,87.0687,72.79,22.3457,24.1728,29.505,80.8765,252.7454,280.6002,252.9573,348.246,112.705,98.7545,317.0541,300.9573,402.8411,406.6884,56.1286,30.1385,32.9909,497.556,19.3606,20.8409,95.2324,108.6074,15.7753,54.5511,45.5623,64.564,101.1934,81.8459,88.286,58.2642,56.1225,51.2943,38.0649,63.5882,63.6847,120.495,102.4097,49.3255,111.3309,171.6028,58.9526,28.7698,144.6884,180.0661,116.6028,146.2594,199.8702,128.9378,423.2363,119.8537,124.6508,518.8625,306.3023,79.5213,121.0309,116.9346,170.8863,930.361,48.9983,55.039,47.1092,72.0548,75.4045,103.521,83.4134,142.3253,146.6215,121.4467,101.4252,68.4812,291.4275,143.9475,142.647,78.9826,47.094,204.2196,89.0208,82.792,27.1346,142.4764,83.7874,67.3216,112.9531,138.2549,133.3446,86.2659,45.3464,56.1604,43.5882,54.3623,86.296,115.7272,96.5498,111.8081,36.1756,40.2947,34.2532,89.1452,53.9062,36.458,113.9297,176.9962,77.3125,77.8891,64.807,64.1515,127.7242,119.6876,976.2324,322.8454,434.2883,168.6923,250.0284,234.7329,131.0793,152.335,118.8838,243.1772,24.1776,168.6327,170.7541,167.8444,75.9315,110.1045,113.4417,60.5464,66.8956,79.7606,71.6659,72.5251,77.513,207.8019,21.8592,35.2787,169.7698,146.5012,412.9934,248.0708,318.5489,104.1278,184.7592,108.0581,175.2646,169.7698,340.3732,570.3396,23.9853,69.0405,66.7391,67.9435,294.6085,68.0537,77.6344,433.2713,104.3178,229.4615,187.8587,78.1399,121.4737,122.5451,384.5935,38.5232,117.6835,50.3308,318.2513,103.6695,20.7181,321.9601,510.3248,13.4754,16.1188,44.8082,37.7291,733.4587,446.6241,21.1822,287.9603,327.2367,274.1109,195.4713,158.2114,64.4537,26.9857,172.8503]
y = [37,40,30,29,24,23,27,12,21,20,29,28,27,32,23,29,28,22,28,23,24,29,32,18,22,12,12,14,29,31,34,31,22,40,25,36,27,27,29,35,33,25,25,27,27,19,35,26,18,24,25,37,52,47,34,39,40,48,41,44,35,36,53,46,38,44,23,26,26,28,27,21,25,21,20,27,35,24,46,34,22,30,30,30,31,26,25,28,21,31,24,27,33,21,31,33,29,33,32,21,25,22,39,31,34,26,23,18,20,18,34,25,20,12,23,25,21,21,25,31,17,27,28,29,25,24,25,21,24,27,23,22,23,22,22,26,22,19,26,35,33,35,29,26,26,30,22,32,33,33,28,32,26,29,36,37,37,28,24,30,25,20,29,24,33,35,30,32,31,33,40,35,37,24,34,29,27,24,36,26,26,26,27,27,20,17,28,34,18,20,20,18,19,23,20,22,25,32,44,41,39,41,40,44,36,42,31,32,26,29,23,29,29,28,31,22,29,24,28,28,25]
xbreaks = [13.4754, 27.1346, 43.5882, 58.9526, 72.79, 89.1452, 110.1045, 131.0793, 158.2114, 180.0661, 207.8019, 234.7329, 252.9573, 300.9573, 327.2367, 348.246, 412.9934, 434.2883, 458.451, 518.8625, 595.5373, 640.6542, 733.4587, 815.8389, 930.361, 976.2324]
df = pd.DataFrame([x,y]).T
df.columns = ['x','y']
# Check the bin average and std using agge
bins = pd.cut(df.x,xbreaks,right=False)
t = df[['x','y']].groupby(bins).agg({"x": "mean", "y": ["mean","std"]})
t.reset_index(inplace=True)
t.columns = ['range_cut','x_avg_cut','y_avg_cut','y_std_cut']
t.index.name ='id'
# Get the bin average from
g = sns.regplot(x='x',y='y',data=df,fit_reg=False,x_bins=xbreaks,seed=seed)
xye = pd.DataFrame(get_data_XYE(g)).T
xye.columns = ['x_regplot','y_regplot','e_regplot']
xye.index.name = 'id'
t2 = xye.merge(t,on='id',how='left')
t2
You can see the y and e from the two ways are different. I understand that the default x_ci or x_estimator may afect the result of regplot, but I still can not the these values in excel by removing some lowest and/or highest values in each bin.
In seaborn.regplot, the x_bins are the center of each bin, and the original x values are assigned to the nearest bin value. Whereas in pandas.cut, the breaks define the bin edges.

Bin average as a function of position

I want to efficiently calculate the average of a variable (say temperature) over multiple areas of the plane.
I essentially want to do the following.
import numpy as np
num = 10000
XYT = np.random.uniform(0, 1, (num, 3))
X = np.transpose(XYT)[0]
Y = np.transpose(XYT)[1]
T = np.transpose(XYT)[2]
size = 10
bins = np.empty((size, size))
for i in range(size):
for j in range(size):
if rescaled X,Y in bin[i][j]:
bins[i][j] = mean T
I would use pandas (although im sure you can achieve basically the same with vanilla numpy)
df = pandas.DataFrame({'x':npX,'y':npY,'z':npZ})
# solve quadrants
df['quadrant'] = (df['x']>=0)*2 + (df['y']>=0)*1
# group by and aggregate
mean_per_quadrant = df.groupby(['quadrant'])['temp'].aggregate(['mean'])
you may need to create multiple quadrant cutoffs to get unique groupings
for example (df['x']>=50)*4 + (df['x']>=0)*2 + (df['y']>=0)*1 would add an extra 2 quadrants to our group (one y>=0, and one y<0) (just make sure you use powers of 2)

python: increase performance of finding the best timeshift for a correlation between each X column and y

I have a dataframe X with several columns and a dataframe y with only one column (series). The rows in X represent timesteps and I want to find the interval I need to shift each column of X to obtain the highest correlation with y. I wrote a function that loops over all columns and then loops over all timesteps and correlates the X column with y. If the R² is better than before I store the timestep. However, with over 300 columns this routine is really taking some time and I need to increase the performance. Is there a nice way to simplify this code?
(In the example I used the iris data set which is of course not a timeseries...)
from sklearn import datasets
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
from copy import deepcopy
def get_best_shift(dfX, dfy, ti=60, maxt=1440):
"""
determines the best correlation for the last maxt minutes based on a
timestep of ti minutes. Creates a dataframe with the shifted variables based on the
best match (strongest correlation).
"""
df_out = deepcopy(dfX)
for xcol in dfX:
bestshift = 0
Rmax = 0
for ishift in range(0, int(maxt / ti)):
xvals = dfX[xcol].iloc[0:(dfX.shape[0] - ishift)].values
yvals = np.array([val[0] for val in dfy.iloc[ishift:dfy.shape[0]].values])
selector = np.array([str(val)!="nan" for val in (xvals*yvals)],dtype=bool)
xvals = xvals[selector]
yvals = yvals[selector]
R = np.corrcoef(xvals,yvals)[0][1]
# plt.figure()
# plt.plot(xvals,yvals,'k.')
# plt.show()
if R ** 2 > Rmax:
Rmax = R ** 2
# print(Rmax)
bestshift = ishift
df_out[xcol] = list(np.zeros(bestshift)) + list(dfX[xcol].iloc[0:dfX.shape[0] - bestshift].values)
df_out = df_out.rename(columns={xcol: ''.join([str(xcol), '_t-', str(bestshift)])})
return df_out
iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
y = pd.DataFrame(iris.target)
df = get_best_shift(X,y)

Panda dataframe column cut - add more bins more frequently around the mean

I am categorizing quantitative variable (e.g. price) and I would like to categorize it in the manner that the bins would be much more frequent around the mean and less when away from the mean.
I have seen that there are possibilities to cut() in linear manner and thanks to numpy.logspace in logarithmic manner, but binning around the mean seems to be void and my ideas so far haven't worked and seem to be inefficient.
You can make bins that increase in size linearly:
import numpy as np
def make_progressive_bins(min_x, max_x, mean_x, num_bins=10):
x_rel_lim = max(mean_x - min_x, mean_x - max_x)
num_bins_half = num_bins // 2
bins_right = np.arange(0, num_bins_half + 1)
if num_bins % 2 == 1:
bins_right = bins_right + 0.5
bins_right = np.cumsum(bins_right)
bins = np.concatenate([-bins_right[bins_right > 0][::-1], bins_right])
bins = bins * (float(x_rel_lim) / bins[-1]) + mean_x
return bins
And then you can use it like:
import numpy as np
import matplotlib.pyplot as plt
bins = make_progressive_bins(-20, 50, 10, 15)
plt.bar(bins - 0.1, np.ones_like(bins), 0.2)
I made a script that might do what you want to achieve, but I'm not sure how to convert the resulted cut object into a histogram to see if it does what i want it to do, so please check and tell me if it works :).
# Make normally distributed price with mean 50.
df = pd.DataFrame(data=np.random.normal(50, size=1000), columns=['price'])
df.hist(bins=30)
num_bins = 100
# I used a square function to distribute the bins more around 0 and
# less at the outskirts of the range.
shape_func = lambda x: x**2
bin_loc = [shape_func(i) for i in range(num_bins//2)]
mirrored_bin_loc = [-x for x in bin_loc[::-1]]
bin_loc = mirrored_bin_loc + bin_loc[1:]
# Rescale and translate bins
data_mean = df.price.mean()
data_range = df.price.max() - df.price.min()
final_bin_loc = [(x + data_mean) / (data_range * num_bins) for x in bin_loc]
# display(final_bin_loc)
binned = pd.cut(df.price, bin_loc)

Referencing Data From a 2D Histogram

I have the following code that reads data from a CSV file and creates a 2D histogram:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
#Read in CSV data
filename = 'Complete_Storms_All_US_Only.csv'
df = pd.read_csv(filename)
min_85 = df.min85
min_37 = df.min37
verification = df.one_min_15
#Numbers
x = min_85
y = min_37
H = verification
#Estimate the 2D histogram
nbins = 33
H, xedges, yedges = np.histogram2d(x,y,bins=nbins)
#Rotate and flip H
H = np.rot90(H)
H = np.flipud(H)
#Mask zeros
Hmasked = np.ma.masked_where(H==0,H)
#Calculate Averages
avgarr = np.zeros((nbins, nbins))
xbins = np.digitize(x, xedges[1:-1])
ybins = np.digitize(y, yedges[1:-1])
for xb, yb, v in zip(xbins, ybins, verification):
avgarr[yb, xb] += v
divisor = H.copy()
divisor[divisor==0.0] = np.nan
avgarr /= divisor
binavg = np.around((avgarr * 100), decimals=1)
binper = np.ma.array(binavg, mask=np.isnan(binavg))
#Plot 2D histogram using pcolor
fig1 = plt.figure()
plt.pcolormesh(xedges,yedges,binper)
plt.title('1 minute at +/- 0.15 degrees')
plt.xlabel('min 85 GHz PCT (K)')
plt.ylabel('min 37 GHz PCT (K)')
cbar = plt.colorbar()
cbar.ax.set_ylabel('Probability of CG Lightning (%)')
plt.show()
Each pixel in the histogram contains the probability of lightning for a given range of temperatures at two different frequencies on the x and y axis (min_85 on the x axis and min_37 on the y axis). I am trying to reference the probability of lightning from the histogram based on a wide range of temperatures that vary on an individual basis for any given storm. Each storm has a min_85 and min_37 that corresponds to a probability from the 2D histogram. I know there is a brute-force method where you can create a ridiculous amount of if statements, with one for each pixel, but this is tedious and inefficient when trying to incorporate over multiple 2D histograms. Is there a more efficient way to reference the probability from the histogram based on the given min_85 and min_37? I have a separate file with the min_85 and min_37 data for a large amount of storms, I just need to assign the corresponding probability of lightning from the histogram to each one.
It sounds like all you need to do is turn the min_85 and min_37 values into indices. Something like this will work:
# min85data and min37data from your file
dx = xedges[1] - xedges[0]
dy = yedges[1] - yedges[0]
min85inds = np.floor((min85data - yedges[1]) / dx).astype(np.int)
min37inds = np.floor((min37data - yedges[0]) / dy).astype(np.int)
# Pretend you didn't do all that flipping of H, or make a copy of it first
hvals = h_orig[min85inds, min37ends]
But do make sure that the resulting indices are valid before you extract them.

Categories

Resources