Pandas assign each row the mean of its bin

Pandas assign each row the mean of its bin - python

I have the following dataframe (p1.head(7)):
ColA
0 6.286333
1 3.317000
2 13.24889
3 26.20667
4 26.25556
5 60.59000
6 79.59000
7 1.361111
I can get the bin ranges using:
pandas.qcut(p1.ColA, 4)
Is there a way I can create a new column where each value corresponds to the mean value of the bin? I.e for each bin, (a,b], I want (a+b)/2

The key here is the retbins option on qcut.
import pandas
df = pandas.DataFrame(np.random.random(100)*100, columns=['val1'])
pctiles = pandas.qcut(df['val1'],4,retbins=True)
pctile_object = pctiles[0]
pctile_boundaries = pctiles[1]
Here pctile_object is just what qcut would return if you hadn't passed retbins=True, and pctile_boundaries is a numpy array of the interval boundaries.
import numpy
bin_halfway = pctile_boundaries[:-1] + (numpy.diff(pctile_boundaries)/2)
This gives us the halfway points of the bins.
Now we make a dataframe with just the interval names (as strings) and the halfway points.
df2 = pandas.DataFrame({'quartile boundaries': pctile_object.levels,
'midway point': bin_halfway})
Finally, merge the bin halfway points back into the original dataframe.
df['quartile boundaries'] = pctile_object
pandas.merge(df,df2,on='quartile boundaries')
Then you can drop quartile boundaries if you want.

I wrote a function to utilize #exp1orer 's logic:
def midway_quantiles(feature_series,q=4):
import pandas as pd
pctiles = pd.qcut(feature_series,q,retbins=True)
pctile_object = pctiles[0]
df1= pd.DataFrame({"feature":feature_series,"q_bound": pctile_object})
pctile_boundaries = pctiles[1]
import numpy as np
bin_halfway = pctile_boundaries[:-1] + (np.diff(pctile_boundaries)/2)
df2 = pd.DataFrame({"q_bound": pctile_object.cat.categories,
"midpoint": bin_halfway})
df3=pd.merge(df1,df2,on="q_bound",how="left")
return df3["midpoint"]

Related

Compute number of floats in a int range - Python

I've the following dataframe containing floats as input and would like to compute how many values are in range 0;90 and 90;180. The output dataframe was obtained using frequency() function from excel.
[Input dataframe]
[Desired output]
I'd like to do the same thing with python but didn't find a solution. Do you have any suggestion ?
I can also provide source files if needed.

Here's one way, by dividing the columns by 90, then using groupy and count:
import numpy as np
import pandas as pd
data = [
[87.084,5.293],
[55.695,0.985],
[157.504,2.995],
[97.701,179.593],
[97.67,170.386],
[118.713,177.53],
[99.972,176.665],
[124.849,1.633],
[72.787,179.459]
]
df = pd.DataFrame(data,columns=['Var1','Var2'])
df = (df / 90).astype(int)
df1 = pd.DataFrame([["0-90"], ["90-180"]])
df1['Var1'] = df.groupby('Var1').count()
df1['Var2'] = df.groupby('Var2').count()
print(df1)
Output:
0 Var1 Var2
0 0-90 3 4
1 90-180 6 5

How to insert a new column into a dataframe and access rows with different indices?

I have a dataframe with one column "Numbers" and I want to add a second column "Result". The values should be the sum of the previous two values in the "Numbers" column, otherwise NaN.
import pandas as pd
import numpy as np
data = {
"Numbers": [100,200,400,0]
}
df = pd.DataFrame(data,index = ["whatever1", "whatever2", "whatever3", "whatever4"])
def add_prev_two_elems_to_DF(df):
numbers = "Numbers" # alias
result = "Result" # alias
df[result] = np.nan # empty column
result_index = list(df.columns).index(result)
for i in range(len(df)):
#row = df.iloc[i]
if i < 2: df.iloc[i,result_index] = np.nan
else: df.iloc[i,result_index] = df.iloc[i-1][numbers] + df.iloc[i-2][numbers]
add_prev_two_elems_to_DF(df)
display(df)
The output is:
Numbers Result
whatever1 100 NaN
whatever2 200 NaN
whatever3 400 300.0
whatever4 0 600.0
But this looks quite complicated. Can this be done easier and maybe faster? I am not looking for a solution with sum(). I want a general solution for any kind of function that can fill a column using values from other rows.
Edit 1: I forgot to import numpy.
Edit 2: I changed one line to this:
if i < 2: df.iloc[i,result_index] = np.nan

Looks like you could use rolling.sum together with shift. Since rollling.sum sums until the current row, we have to shift it down one row, so that each row value matches to the sum of the previous 2 rows:
df['Result'] = df['Numbers'].rolling(2).sum().shift()
Output:
Numbers Result
whatever1 100 NaN
whatever2 200 NaN
whatever3 400 300.0
whatever4 0 600.0

This is the shortest code I could develop. It outputs exactly the same table.
import numpy as np
import pandas as pd
#import swifter # apply() gets swifter
data = {
"Numbers": [100,200,400,0]
}
df = pd.DataFrame(data,index = ["whatever1", "whatever2", "whatever3", "whatever4"])
def func(a: np.ndarray) -> float: # we expect 3 elements, but we don't check that
a.reset_index(inplace=True,drop=True) # the index now starts with 0, 1,...
return a[0] + a[1] # we use the first two elements, the 3rd is unnecessary
df["Result"] = df["Numbers"].rolling(3).apply(func)
#df["Result"] = df["Numbers"].swifter.rolling(3).apply(func)
display(df)

how i can precise this python code into few lines?

I want to precise this code where I am finding mean which is updating data frame. How I can find patterns and take this code in few lines.
import pandas as pd
import numpy as np
df = pd.read_csv('Dataset2.csv')
df = df.to_numpy()
for i in range (0,len(df)):
mean_1 = df[i,1:5].sum() / 4
mean_2 = (df[i,0:1].sum() + df[i,2:5].sum()) / 4
mean_3 = (df[i,0:2].sum() + df[i,3:5].sum()) / 4
mean_4 = (df[i,0:3].sum() + df[i,4:5].sum()) / 4
mean_5 = df[i,0:4].sum() / 4
df[i,0] = df[i,0] - mean_1
df[i,1] = df[i,1] - mean_2
df[i,2] = df[i,2] - mean_3
df[i,3] = df[i,3] - mean_4
df[i,4] = df[i,4] - mean_5

My interpretation of what you are trying to do is
given a dataframe df, create a new dataframe where the value of the element in row i, column j, is given by the mean of all values in row i - except the one in column j
If this is correct then the following will be much quicker. It assumes the dataframe only consists of columns necessary for this calculation. If there are extra columns you will need to adjust the solution with indexing
means = (df.sum(axis=1).reshape((len(df),1)) - df)/4
means will be a numpy array so wrap it up in a pandas Dataframe if that's what you need

Interpolate PANDAS df

I know this subject was brought up a few times on stack overflow, however I'm still stumbling upon an interpolation problem.
I have a complex dataframe of a set of columns, which could look something like this if simplified:
df_new = pd.DataFrame(np.random.randn(5,7), columns=[402.3, 407.2, 412.3, 415.8, 419.9, 423.5, 428.3])
wl = np.array([400.0, 408.2, 412.5, 417.2, 420.5, 423.3, 425.0])
So what I need to do is to interpolate column-wise, to the new assigned values of cols (wl), for each row.
And how to get the new dataframe with columns ONLY containing values presented in the wl array?

Use reindex to include wl as new columns (whose values will be filled with NaNs).
Then use interpolate(axis=1) to interpolate across the columns.
Strictly speaking interpolation is only done between known values.
You could, however, use limit_direction='both' to fill NaN edge values in both the forward and backward directions:
>>> df_new.reindex(columns=df_new.columns.union(wl)).interpolate(axis=1, limit_direction='both')
400.0 402.3 407.2 408.2 412.3 412.5 415.8 417.2 419.9 420.5 423.3 423.5 425.0 428.3
0 0.342346 0.342346 1.502418 1.102496 0.702573 0.379089 0.055606 -0.135563 -0.326732 -0.022298 0.282135 0.586569 0.164917 -0.256734
1 -0.220773 -0.220773 -0.567199 -0.789194 -1.011190 -0.485832 0.039526 -0.426771 -0.893069 -0.191818 0.509432 1.210683 0.414023 -0.382636
2 0.078147 0.078147 0.335040 -0.146892 -0.628824 -0.280976 0.066873 -0.881153 -1.829178 -0.960608 -0.092038 0.776532 0.458758 0.140985
3 -0.792214 -0.792214 0.254805 0.027573 -0.199659 -1.173250 -2.146841 -1.421482 -0.696124 -0.073018 0.550088 1.173194 -0.049967 -1.273128
4 -0.485818 -0.485818 0.019046 -1.421351 -2.861747 -1.020571 0.820605 0.097722 -0.625160 -0.782700 -0.940241 -1.097781 -0.809617 -0.521453
Note that Pandas DataFrames store values in a primarily column-based data structure. So computations are generally more efficient when done column-wise, not row-wise. Therefore, it might be better to transpose your dataframe:
df = df_new.T
and then proceed similarly as described above:
df = df.reindex(index=df.index.union(wl))
df = df.interpolate(limit_direction='both')
If you want to extrapolate edge values, you could use scipy.interpolate.interp1d with :
fill_value='extrapolate':
import numpy as np
import pandas as pd
import scipy.interpolate as interpolate
np.random.seed(2018)
df_new = pd.DataFrame(np.random.randn(5,7), columns=[402.3, 407.2, 412.3, 415.8, 419.9, 423.5, 428.3])
wl = np.array([400.0, 408.2, 412.5, 417.2, 420.5, 423.3, 425.0, 500])
x = df_new.columns
y = df_new.values
newx = x.union(wl)
result = pd.DataFrame(
interpolate.interp1d(x, y, fill_value='extrapolate')(newx),
columns=newx)
yields
400.0 402.3 407.2 408.2 412.3 412.5 415.8 417.2 419.9 420.5 423.3 423.5 425.0 428.3 500.0
0 -0.679793 -0.276768 0.581851 0.889017 2.148399 1.952520 -1.279487 -0.671080 0.502277 0.561236 0.836376 0.856029 0.543898 -0.142790 -15.062654
1 0.484717 0.110079 -0.688065 -0.468138 0.433564 0.437944 0.510221 0.279613 -0.165131 -0.362906 -1.285854 -1.351779 -0.758526 0.546631 28.904127
2 1.303039 1.230655 1.076446 0.628001 -1.210625 -1.158971 -0.306677 -0.563028 -1.057419 -0.814173 0.320975 0.402057 0.366778 0.289165 -1.397156
3 2.385057 1.282733 -1.065696 -1.191370 -1.706633 -1.618985 -0.172797 -0.092039 0.063710 0.114863 0.353577 0.370628 -0.246613 -1.604543 -31.108665
4 -3.360837 -2.165729 0.380370 0.251572 -0.276501 -0.293597 -0.575682 -0.235060 0.421854 0.469009 0.689062 0.704780 0.498724 0.045401 -9.804075
If you wish to create a DataFrame containing only the wl columns, you could sub-select those columns using result[wl], or you could simplying interpolate only at the wl values:
result_wl = pd.DataFrame(
interpolate.interp1d(x, y, fill_value='extrapolate')(wl),
columns=wl)

Creating new pandas columns with original value plus random number in error range

I have a pandas dataframe which has a column 'INTENSITY' and a numpy array of same length containing the error for each intensity. I would like to generate columns with randomly generated numbers in the error range.
So far I use two nested for loops to create the new columns but I feel like this is inefficient:
theor_err = [ sqrt(abs(x)) for x in theor_df[str(INTENSITY)] ]
theor_err = np.asarray(theor_err)
for nr_sample in range(2):
sample = np.zeros(len(theor_df[str(INTENSITY)]))
for i, error in enumerate(theor_err):
sample[i] = theor_df[str(INTENSITY)][i] + random.uniform(-error, error)
theor_df['gen_{}'.format(nr_sample)] = Series(sample, index=theor_df.index)
theor_df.head()
Is there a more efficient way of approaching a problem like this?

Numpy can handle arrays for you. So, you can do it like this:
import pandas as pd
import numpy as np
a=pd.DataFrame([10,20,15,30],columns=['INTENSITY'])
a['theor_err']=np.sqrt(np.abs(a.INTENSITY))
a['sample']=np.random.uniform(-a['theor_err'],a['theor_err'])

Suppose you want to generate 6 samples. You can try to code below. You can tune the number of samples you want by setting the value k.
df = pd.DataFrame([[1],[2],[3],[4],[-5]], columns=["intensity"])
k = 6
sample_names = ["sample" + str(i+1) for i in range(k)]
df["err"] = np.sqrt(np.abs((df["intensity"])))
df[sample_names] = pd.DataFrame(
df["err"].map(lambda x: np.random.uniform(-x, x, k)).values.tolist())
df.loc[:,sample_names] = df.loc[:,sample_names].add(df.intensity, axis=0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas assign each row the mean of its bin - python

Related

Compute number of floats in a int range - Python

How to insert a new column into a dataframe and access rows with different indices?

how i can precise this python code into few lines?

Interpolate PANDAS df

Creating new pandas columns with original value plus random number in error range

Categories

Resources