average point on each bin pandas - python
I have 2 dataframes temperature(y) and ratio(x). In each dataframe I have 60 columns corresponding to 60 different machines that measure both parameters.
for now I have a plot for each machine of y vs x, as follow:
for column in ratio.columns:
x = ratio[column]
y = temperature[column]
if len(x) != len(y):
x_ind = x.index
y_ind = y.index
common_ind = x_ind.intersection(y_ind)
x = x[common_ind]
y = y[common_ind]
plt.scatter(x,y)
plt.savefig("plot" +column+".png")
plt.clf()
because I have a lot of data points, I want to do binning for each machine and to do an average on each bin, so that I will have an average point of y for each bin.
x is between 0 and 1 and I want to bin every 0.05, which gives 20 bins.
I got an histogram for each machine by doing:
for x in ratio.columns:
ratio.hist(column = x, bins = 20)
but this is only giving number of events vs ratio.
how can I link the temperature dataframe
I am new to pandas and I can't figure out how to do this
mask bin every 20
mask = my_df.index//20
then use groupby and agg
my_df.groupby(mask).agg(['mean'])
Related
Having some problem to understand the x_bin in regplot of Seaborn
I used the seaborn.regplot to plot data, but not quite understand how the error bar in regplot was calculated. I have compared the results with the mean and standard deviation derived from mannual calculation. Here is my testing script. import numpy as np import pandas as pd import seaborn as sn def get_data_XYE(p): x_list = [] lower_list = [] upper_list = [] for line in p.lines: x_list.append(line.get_xdata()[0]) lower_list.append(line.get_ydata()[0]) upper_list.append(line.get_ydata()[1]) y = 0.5 * (np.asarray(lower_list) + np.asarray(upper_list)) y_error = np.asarray(upper_list) - y x = np.asarray(x_list) return x, y, y_error x = [37.3448,36.6026,42.7795,34.7072,75.4027,226.2615,192.7984,140.8045,242.9952,458.451,640.6542,726.1024,231.7347,107.5605,200.2254,190.0006,314.1349,146.8131,152.4497,175.9096,284.9926,116.9681,118.2953,312.3787,815.8389,458.0146,409.5797,595.5373,188.9955,15.7716,36.1839,244.8689,57.4579,94.8717,112.2237,87.0687,72.79,22.3457,24.1728,29.505,80.8765,252.7454,280.6002,252.9573,348.246,112.705,98.7545,317.0541,300.9573,402.8411,406.6884,56.1286,30.1385,32.9909,497.556,19.3606,20.8409,95.2324,108.6074,15.7753,54.5511,45.5623,64.564,101.1934,81.8459,88.286,58.2642,56.1225,51.2943,38.0649,63.5882,63.6847,120.495,102.4097,49.3255,111.3309,171.6028,58.9526,28.7698,144.6884,180.0661,116.6028,146.2594,199.8702,128.9378,423.2363,119.8537,124.6508,518.8625,306.3023,79.5213,121.0309,116.9346,170.8863,930.361,48.9983,55.039,47.1092,72.0548,75.4045,103.521,83.4134,142.3253,146.6215,121.4467,101.4252,68.4812,291.4275,143.9475,142.647,78.9826,47.094,204.2196,89.0208,82.792,27.1346,142.4764,83.7874,67.3216,112.9531,138.2549,133.3446,86.2659,45.3464,56.1604,43.5882,54.3623,86.296,115.7272,96.5498,111.8081,36.1756,40.2947,34.2532,89.1452,53.9062,36.458,113.9297,176.9962,77.3125,77.8891,64.807,64.1515,127.7242,119.6876,976.2324,322.8454,434.2883,168.6923,250.0284,234.7329,131.0793,152.335,118.8838,243.1772,24.1776,168.6327,170.7541,167.8444,75.9315,110.1045,113.4417,60.5464,66.8956,79.7606,71.6659,72.5251,77.513,207.8019,21.8592,35.2787,169.7698,146.5012,412.9934,248.0708,318.5489,104.1278,184.7592,108.0581,175.2646,169.7698,340.3732,570.3396,23.9853,69.0405,66.7391,67.9435,294.6085,68.0537,77.6344,433.2713,104.3178,229.4615,187.8587,78.1399,121.4737,122.5451,384.5935,38.5232,117.6835,50.3308,318.2513,103.6695,20.7181,321.9601,510.3248,13.4754,16.1188,44.8082,37.7291,733.4587,446.6241,21.1822,287.9603,327.2367,274.1109,195.4713,158.2114,64.4537,26.9857,172.8503] y = [37,40,30,29,24,23,27,12,21,20,29,28,27,32,23,29,28,22,28,23,24,29,32,18,22,12,12,14,29,31,34,31,22,40,25,36,27,27,29,35,33,25,25,27,27,19,35,26,18,24,25,37,52,47,34,39,40,48,41,44,35,36,53,46,38,44,23,26,26,28,27,21,25,21,20,27,35,24,46,34,22,30,30,30,31,26,25,28,21,31,24,27,33,21,31,33,29,33,32,21,25,22,39,31,34,26,23,18,20,18,34,25,20,12,23,25,21,21,25,31,17,27,28,29,25,24,25,21,24,27,23,22,23,22,22,26,22,19,26,35,33,35,29,26,26,30,22,32,33,33,28,32,26,29,36,37,37,28,24,30,25,20,29,24,33,35,30,32,31,33,40,35,37,24,34,29,27,24,36,26,26,26,27,27,20,17,28,34,18,20,20,18,19,23,20,22,25,32,44,41,39,41,40,44,36,42,31,32,26,29,23,29,29,28,31,22,29,24,28,28,25] xbreaks = [13.4754, 27.1346, 43.5882, 58.9526, 72.79, 89.1452, 110.1045, 131.0793, 158.2114, 180.0661, 207.8019, 234.7329, 252.9573, 300.9573, 327.2367, 348.246, 412.9934, 434.2883, 458.451, 518.8625, 595.5373, 640.6542, 733.4587, 815.8389, 930.361, 976.2324] df = pd.DataFrame([x,y]).T df.columns = ['x','y'] # Check the bin average and std using agge bins = pd.cut(df.x,xbreaks,right=False) t = df[['x','y']].groupby(bins).agg({"x": "mean", "y": ["mean","std"]}) t.reset_index(inplace=True) t.columns = ['range_cut','x_avg_cut','y_avg_cut','y_std_cut'] t.index.name ='id' # Get the bin average from g = sns.regplot(x='x',y='y',data=df,fit_reg=False,x_bins=xbreaks,seed=seed) xye = pd.DataFrame(get_data_XYE(g)).T xye.columns = ['x_regplot','y_regplot','e_regplot'] xye.index.name = 'id' t2 = xye.merge(t,on='id',how='left') t2 You can see the y and e from the two ways are different. I understand that the default x_ci or x_estimator may afect the result of regplot, but I still can not the these values in excel by removing some lowest and/or highest values in each bin.
In seaborn.regplot, the x_bins are the center of each bin, and the original x values are assigned to the nearest bin value. Whereas in pandas.cut, the breaks define the bin edges.
Bin average as a function of position
I want to efficiently calculate the average of a variable (say temperature) over multiple areas of the plane. I essentially want to do the following. import numpy as np num = 10000 XYT = np.random.uniform(0, 1, (num, 3)) X = np.transpose(XYT)[0] Y = np.transpose(XYT)[1] T = np.transpose(XYT)[2] size = 10 bins = np.empty((size, size)) for i in range(size): for j in range(size): if rescaled X,Y in bin[i][j]: bins[i][j] = mean T
I would use pandas (although im sure you can achieve basically the same with vanilla numpy) df = pandas.DataFrame({'x':npX,'y':npY,'z':npZ}) # solve quadrants df['quadrant'] = (df['x']>=0)*2 + (df['y']>=0)*1 # group by and aggregate mean_per_quadrant = df.groupby(['quadrant'])['temp'].aggregate(['mean']) you may need to create multiple quadrant cutoffs to get unique groupings for example (df['x']>=50)*4 + (df['x']>=0)*2 + (df['y']>=0)*1 would add an extra 2 quadrants to our group (one y>=0, and one y<0) (just make sure you use powers of 2)
python: increase performance of finding the best timeshift for a correlation between each X column and y
I have a dataframe X with several columns and a dataframe y with only one column (series). The rows in X represent timesteps and I want to find the interval I need to shift each column of X to obtain the highest correlation with y. I wrote a function that loops over all columns and then loops over all timesteps and correlates the X column with y. If the R² is better than before I store the timestep. However, with over 300 columns this routine is really taking some time and I need to increase the performance. Is there a nice way to simplify this code? (In the example I used the iris data set which is of course not a timeseries...) from sklearn import datasets import pandas as pd import numpy as np #import matplotlib.pyplot as plt from copy import deepcopy def get_best_shift(dfX, dfy, ti=60, maxt=1440): """ determines the best correlation for the last maxt minutes based on a timestep of ti minutes. Creates a dataframe with the shifted variables based on the best match (strongest correlation). """ df_out = deepcopy(dfX) for xcol in dfX: bestshift = 0 Rmax = 0 for ishift in range(0, int(maxt / ti)): xvals = dfX[xcol].iloc[0:(dfX.shape[0] - ishift)].values yvals = np.array([val[0] for val in dfy.iloc[ishift:dfy.shape[0]].values]) selector = np.array([str(val)!="nan" for val in (xvals*yvals)],dtype=bool) xvals = xvals[selector] yvals = yvals[selector] R = np.corrcoef(xvals,yvals)[0][1] # plt.figure() # plt.plot(xvals,yvals,'k.') # plt.show() if R ** 2 > Rmax: Rmax = R ** 2 # print(Rmax) bestshift = ishift df_out[xcol] = list(np.zeros(bestshift)) + list(dfX[xcol].iloc[0:dfX.shape[0] - bestshift].values) df_out = df_out.rename(columns={xcol: ''.join([str(xcol), '_t-', str(bestshift)])}) return df_out iris = datasets.load_iris() X = pd.DataFrame(iris.data) y = pd.DataFrame(iris.target) df = get_best_shift(X,y)
Panda dataframe column cut - add more bins more frequently around the mean
I am categorizing quantitative variable (e.g. price) and I would like to categorize it in the manner that the bins would be much more frequent around the mean and less when away from the mean. I have seen that there are possibilities to cut() in linear manner and thanks to numpy.logspace in logarithmic manner, but binning around the mean seems to be void and my ideas so far haven't worked and seem to be inefficient.
You can make bins that increase in size linearly: import numpy as np def make_progressive_bins(min_x, max_x, mean_x, num_bins=10): x_rel_lim = max(mean_x - min_x, mean_x - max_x) num_bins_half = num_bins // 2 bins_right = np.arange(0, num_bins_half + 1) if num_bins % 2 == 1: bins_right = bins_right + 0.5 bins_right = np.cumsum(bins_right) bins = np.concatenate([-bins_right[bins_right > 0][::-1], bins_right]) bins = bins * (float(x_rel_lim) / bins[-1]) + mean_x return bins And then you can use it like: import numpy as np import matplotlib.pyplot as plt bins = make_progressive_bins(-20, 50, 10, 15) plt.bar(bins - 0.1, np.ones_like(bins), 0.2)
I made a script that might do what you want to achieve, but I'm not sure how to convert the resulted cut object into a histogram to see if it does what i want it to do, so please check and tell me if it works :). # Make normally distributed price with mean 50. df = pd.DataFrame(data=np.random.normal(50, size=1000), columns=['price']) df.hist(bins=30) num_bins = 100 # I used a square function to distribute the bins more around 0 and # less at the outskirts of the range. shape_func = lambda x: x**2 bin_loc = [shape_func(i) for i in range(num_bins//2)] mirrored_bin_loc = [-x for x in bin_loc[::-1]] bin_loc = mirrored_bin_loc + bin_loc[1:] # Rescale and translate bins data_mean = df.price.mean() data_range = df.price.max() - df.price.min() final_bin_loc = [(x + data_mean) / (data_range * num_bins) for x in bin_loc] # display(final_bin_loc) binned = pd.cut(df.price, bin_loc)
Referencing Data From a 2D Histogram
I have the following code that reads data from a CSV file and creates a 2D histogram: import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt #Read in CSV data filename = 'Complete_Storms_All_US_Only.csv' df = pd.read_csv(filename) min_85 = df.min85 min_37 = df.min37 verification = df.one_min_15 #Numbers x = min_85 y = min_37 H = verification #Estimate the 2D histogram nbins = 33 H, xedges, yedges = np.histogram2d(x,y,bins=nbins) #Rotate and flip H H = np.rot90(H) H = np.flipud(H) #Mask zeros Hmasked = np.ma.masked_where(H==0,H) #Calculate Averages avgarr = np.zeros((nbins, nbins)) xbins = np.digitize(x, xedges[1:-1]) ybins = np.digitize(y, yedges[1:-1]) for xb, yb, v in zip(xbins, ybins, verification): avgarr[yb, xb] += v divisor = H.copy() divisor[divisor==0.0] = np.nan avgarr /= divisor binavg = np.around((avgarr * 100), decimals=1) binper = np.ma.array(binavg, mask=np.isnan(binavg)) #Plot 2D histogram using pcolor fig1 = plt.figure() plt.pcolormesh(xedges,yedges,binper) plt.title('1 minute at +/- 0.15 degrees') plt.xlabel('min 85 GHz PCT (K)') plt.ylabel('min 37 GHz PCT (K)') cbar = plt.colorbar() cbar.ax.set_ylabel('Probability of CG Lightning (%)') plt.show() Each pixel in the histogram contains the probability of lightning for a given range of temperatures at two different frequencies on the x and y axis (min_85 on the x axis and min_37 on the y axis). I am trying to reference the probability of lightning from the histogram based on a wide range of temperatures that vary on an individual basis for any given storm. Each storm has a min_85 and min_37 that corresponds to a probability from the 2D histogram. I know there is a brute-force method where you can create a ridiculous amount of if statements, with one for each pixel, but this is tedious and inefficient when trying to incorporate over multiple 2D histograms. Is there a more efficient way to reference the probability from the histogram based on the given min_85 and min_37? I have a separate file with the min_85 and min_37 data for a large amount of storms, I just need to assign the corresponding probability of lightning from the histogram to each one.
It sounds like all you need to do is turn the min_85 and min_37 values into indices. Something like this will work: # min85data and min37data from your file dx = xedges[1] - xedges[0] dy = yedges[1] - yedges[0] min85inds = np.floor((min85data - yedges[1]) / dx).astype(np.int) min37inds = np.floor((min37data - yedges[0]) / dy).astype(np.int) # Pretend you didn't do all that flipping of H, or make a copy of it first hvals = h_orig[min85inds, min37ends] But do make sure that the resulting indices are valid before you extract them.