Pandas complex math in groupby+aggregation - python

I want to run some complex math while aggregating. I wrote the aggregation function:
import math as mt
# convert calc cols to float from object
cols = dfg_dom_smry.columns
cols = cols[2:]
for col in cols:
df[col] = df[col].astype(float)
# groupby fields
index = ['PA']
#aggregate
df = dfdom.groupby(index).agg({'newcol1': (mt.sqrt(sum('savings'*'rp')**2))/sum('savings')})
I got an error: TypeError: can't multiply sequence by non-int of type 'str'
This is an extract of my data. The full data has many set of savings and rp columns. So ideally I want to run a for loop for each set of savings and rp columns
PA domain savings rp
M M-RET-COM 383,895.36 0.14
P P-RET-AG 14,302,804.19 0.16
P P-RET-COM 56,074,119.28 0.33
P P-RET-IND 46,677,610.00 0.27
P P-SBD/NC-AG 1,411,905.00 -
P P-SBD/NC-COM 4,255,891.25 0.36
P P-SBD/NC-IND 295,365.00 -
S S-RET-AG 2,391,504.33 0.72
S S-RET-COM 19,195,073.84 0.18
S S-RET-IND 17,677,708.38 0.13
S S-SBD/NC-COM 6,116,407.07 0.05
D D-RET-COM 11,944,490.39 0.15
D D-RET-IND 1,213,117.63 -
D D-SBD/NC-COM 2,708,153.57 0.69
C C-RET-AG
C C-RET-COM
C C-RET-IND
For the above data this would be the final result:
PA newcol1
M 0.143027374757981
P 0.18601700701305
S 0.0979541706738756
D 0.166192684106493
C
thanks for your help

What about
o = dfdom.groupby(index).apply(
lambda s: pow(pow(s.savings*s.rp, 2).sum(), .5)/(s.savings.sum() or 1)
)
?
Where s above stands for pandas.Series.
Also, note that o is an instance of pandas.Series, which means that you will have to convert it into a pandas.DataFrame, at least to justify the name you give it, i.e. df. You can do so by doing:
df = o.to_frame('the column name you want')
Put differently/parametrically
def rollup(df, index, svgs, rp, col_name):
return df.groupby(index).apply(
lambda s: pow(pow(s[svgs]*s[rp], 2).sum(), .5)/(s[svgs].sum() or 1)
).to_frame(col_name)
# rollup(dfdom, index, 'savings', 'rp', 'svgs_rp')

update: I replaced the code below with that in the accepted answer.
This is what I finally did. I created a function to step through each of the calculations and call the function for each set of savings and rp cols.
# the function
def rollup(df, index, svgs, rp):
df['svgs_rp'] = (df[svgs]*df[rp])**2
df2 = df.groupby(index).agg({'svgs_rp':'sum',
svgs:'sum'})
df2['temp'] = np.where((df2[svgs] == 0), '', ((df2['svgs_rp']**(1/2))/df2[svgs]))
df2 = df2['temp']
return df2
#calling the function
index = ['PA']
# the next part is within a for loop to go through all the savings and rp column sets. For simplicity I've removed the loop.
svgs = 'savings1'
rp = 'rp1'
dftemp = rollup(dfdom, index, svgs, rp)
dftemp.rename({'temp': newcol1}, axis=1, inplace=True)
df = pd.merge(df, dftemp, on=index, how = 'left') # this last step was not put in my original question. I've provided so the r-code below makes sense.
annoying that I have to first do the math in new columns and then aggregate. This is the equivalent code in R:
# the function
roll_up <- function(savings,rp){
return(sqrt(sum((savings*rp)^2,na.rm=T))/sum(savings,na.rm=T))
# calling the function
df <- df[dfdom[,.(newcol1=roll_up(savings1,rp1),
newcol2=roll_up(savings2,rp2),...
newcol_n=roll_up(savings_n,rp_n)),by='PA'],on='PA']
I'm relatively new to python programming, and this the best I could come up with. If there is a better way to do this, please share. Thanks.

Your groupby should have () and then the [] like this:
df = dfdom.groupby([index]).agg({'newcol1': (mt.sqrt(sum('savings'*'rp')^2))/sum('savings')})

Related

how to apply fsolve over pandas dataframe columns?

I'm trying to solve a system of equations:
and I would like to apply fsolve over a pandas dataframe.
How can I do that?
this is my code:
import numpy as np
import pandas as pd
import scipy.optimize as opt
a = np.linspace(300,400,30)
b = np.random.randint(700,18000,30)
c = np.random.uniform(1.4,4.0,30)
df = pd.DataFrame({'A':a, 'B':b, 'C':c})
def func(zGuess,*Params):
x,y,z = zGuess
a,b,c = Params
eq_1 = ((3.47-np.log10(y))**2+(np.log10(c)+1.22)**2)**0.5
eq_2 = (a/101.32) * (101.32/b)** z
eq_3 = 0.381 * x + 0.05 * (b/101.32) -0.15
return eq_1,eq_2,eq_3
zGuess = np.array([2.6,20.2,0.92])
df['result']= df.apply(lambda x: opt.fsolve(func,zGuess,args=(x['A'],x['B'],x['C'])))
But still not working, and I can't see the problem
The error: KeyError: 'A'
basically means he can't find the reference to 'A'
Thats happening because apply doesn't default to apply on rows.
By setting the parameter 1 at the end, it will iterate on each row, looking for the column reference 'A','B',...
df['result']= df.apply(lambda x: opt.fsolve(func,zGuess,args=(x['A'],x['B'],x['C'])),1)
That however, might not give the desired result, as it will save all the output (an array) into a single column.
For that, make reference to the three columns you want to create, make an interator with zip(*...)
df['output_a'],df['output_b'],df['output_c'] = zip(*df.apply(lambda x: opt.fsolve(func,zGuess,args=(x['A'],x['B'],x['C'])),1) )

Losing variables in Jupyter Notebook

In a jupyter notebook, I declare one variable from file:
with fits.open('mind_dataset/matrix_CEREBELLUM_large.fits') as data:
matrix_cerebellum = pd.DataFrame(data[0].data.byteswap().newbyteorder())
In the cells below, I have two methods:
neuronal_web_pixel = 0.32 # 1 micron => 10e-6 meters
def pixels_to_scale(df, mind=False, cosmos=False):
one_pixel_equals_micron = neuronal_web_pixel
brain_mask = (df != 0.0)
df[brain_mask] *= one_pixel_equals_micron
return df
and
def binarize_matrix(df, mind=False, cosmos=False):
brain_Llink = 16.0 # microns
zero_mask = (df != 0)
low_mask = (df <= brain_Llink)
df[low_mask & zero_mask] = 1.0
higher_mask = (df >= brain_Llink)
df[higher_mask] = 0.0
return df
Then I pass my variables to methods, to obtain scaled and binary dataframes:
matrix_cerebellum_scaled = pixels_to_scale(matrix_cerebellum, mind=True)
And:
matrix_cerebellum_binary = binarize_matrix(matrix_cerebellum_scaled, mind=True)
However, if I call 'matrix_cerebellum_scaled', now it points to 'matrix_cerebellum_binary' and I lose 'matrix_cerebellum_scaled' dataframe.
Why? what am I missing?
Naming thing: those aren't methods, they're functions; now: if you modify a DataFrame within a function those changes still happen to the DataFrame. If you want a new DataFrame, declare it as a copy of the one being passed in.
At the very least at the top of binarize_matrix() do: new_df = df.copy(). More detail about why that's necessary in this SO answer and comments: https://stackoverflow.com/a/39628860/42346

Apply Function under Group == condition

I have a Df as followed:
position_latitude position_longitude geohash
0 53.398940 10.069293 u1
1 53.408875 10.052669 u1
2 48.856350 9.171759 u0
3 48.856068 9.170798 u0
4 48.856350 9.171759 u0
What I want to do know, is receiving the nearest node to this positions, using different Shapefiles based on the Geohash.
So what I want to do, is load for ever group in Geohash (ex u1) the graph from a file and then use this graph in a function for getting the nearest node.
I could do it in a for loop, however I think there are more efficient ways of doing so.
I though of something like this:
df['nearestNode'] = geoSub.apply(lambda x: getDistanceToEdge(x.position_latitude,x. position_longitude,x. geohash), axis=1)
However, I can't figure out how to load the graph only once per group, since it will take some time to get it from the file.
what I came up with so far:
groupHashed = geoSub.groupby('geohash')
geoSub['distance'] = np.nan
for name, group in groupHashed:
G = osmnx.graph.graph_from_xml('geohash/'+name+'.osm', simplify=True, retain_all=False)
geoSub['distance'] = geoSub.apply(lambda x: getDistanceToEdge(x.position_latitude,x.position_longitude, G) if x.geohash == name, axis=1)
definitely seems to work, however I feel like the if condition slows it down drastically
update:
just updated:
geoSub['distance'] = geoSub.apply(lambda x: getDistanceToEdge(x.position_latitude,x.position_longitude, G) if x.geohash == name, axis=1)
to:
geoSub['distance'] = geoSub[geoSub['geohash'] == name].apply(lambda x: getDistanceToEdge(x.position_latitude,x.position_longitude, G), axis=1)
its a lot faster now. is there an even better method?
You can use transform
I am stubbing G and getDistanceToEdge (as x+y+geohash[-1]) so show a working example
import pandas as pd
from io import StringIO
data = StringIO("""
,position_latitude,position_longitude,geohash
0,53.398940,10.069293,u1
1,53.408875,10.052669,u1
2,48.856350,9.171759,u0
3,48.856068,9.170798,u0
4,48.856350,9.171759,u0
""" )
df = pd.read_csv(data, index_col=0).fillna('')
def getDistanceToEdge(x, y, G):
return x+y+G
def fun(pos):
G = int(pos.values[0][-1][-1])
return pos.apply(lambda x: getDistanceToEdge(x[0], x[1], G))
df['pos'] = list(zip(df['position_latitude'], df['position_longitude'], df['geohash']))
df['distance'] = df.groupby(['geohash'])['pos'].transform(fun)
df = df.drop(['pos'], axis=1)
print (df)
Output:
position_latitude position_longitude geohash distance
0 53.398940 10.069293 u1 64.468233
1 53.408875 10.052669 u1 64.461544
2 48.856350 9.171759 u0 58.028109
3 48.856068 9.170798 u0 58.026866
4 48.856350 9.171759 u0 58.028109
As you can see you can get the name of the group using pos.values[0][-1] inside the function fun. This is because we care framing the pos column as a tuple of (lat, log, geohash), and each geohash within a group after groupby is same. So with a group we can grab the geohash by taking the last value of the tuple (pos) of any row. pos.values[0][-1] give the last value of the tuple of the first row.

Python Pandas rolling mean DataFrame Constructor not properly called

I am trying to create a simple time-series, of different rolling types. One specific example, is a rolling mean of N periods using the Panda python package.
I get the following error : ValueError: DataFrame constructor not properly called!
Below is my code :
def py_TA_MA(v, n, AscendType):
df = pd.DataFrame(v, columns=['Close'])
df = df.sort_index(ascending=AscendType) # ascending/descending flag
M = pd.Series(df['Close'].rolling(n), name = 'MovingAverage_' + str(n))
df = df.join(M)
df = df.sort_index(ascending=True) #need to double-check this
return df
Would anyone be able to advise?
Kind regards
found the correction! It was erroring out (new error), where I had to explicitly declare n as an integer. Below, the code works
#xw.func
#xw.arg('n', numbers = int, doc = 'this is the rolling window')
#xw.ret(expand='table')
def py_TA_MA(v, n, AscendType):
df = pd.DataFrame(v, columns=['Close'])
df = df.sort_index(ascending=AscendType) # ascending/descending flag
M = pd.Series(df['Close'], name = 'Moving Average').rolling(window = n).mean()
#df = pd.Series(df['Close']).rolling(window = n).mean()
df = df.join(M)
df = df.sort_index(ascending=True) #need to double-check this
return df

Pass a function with parameters specified for resample() method on a pandas dataframe

I want to pass a function to resample() on a pandas dataframe with certain parameters specified when it is passed (as opposed to defining several separate functions).
This is the function
import itertools
def spell(X, kind='wet', how='mean', threshold=0.5):
if kind=='wet':
condition = X>threshold
else:
condition = X<=threshold
length = [sum(1 if x==True else nan for x in group) for key,group in itertools.groupby(condition)]
if not length:
res = 0
elif how=='mean':
res = np.mean(length)
else:
res = np.max(length)
return res
here is a dataframe
idx = pd.DatetimeIndex(start='1960-01-01', periods=100, freq='d')
values = np.random.random(100)
df = pd.DataFrame(values, index=idx)
And heres sort of what I want to do with it
df.resample('M', how=spell(kind='dry',how='max',threshold=0.7))
But I get the error TypeError: spell() takes at least 1 argument (3 given). I want to be able to pass this function with these parameters specified except for the input array. Is there a way to do this?
EDIT:
X is the input array that is passed to the function when calling the resample method on a dataframe object like so df.resample('M', how=my_func) for a monthly resampling interval.
If I try df.resample('M', how=spell) I get:
0
1960-01-31 1.875000
1960-02-29 1.500000
1960-03-31 1.888889
1960-04-30 3.000000
which is exactly what I want for the default parameters but I want to be able to specify the input parameters to the function before passing it. This might include storing the definition in another variable but I'm not sure how to do this with the default parameters changed.
I think this may be what you're looking for, though it's a little hard to tell.. Let me know if this helps. First, the example dataframe:
idx = pd.DatetimeIndex(start='1960-01-01', periods=100, freq='d')
values = np.random.random(100)
df = pd.DataFrame(values, index=idx)
EDIT- had a greater than instead of less than or equal to originally...
Next, the function:
def spell(df, column='', kind='wet', rule='M', how='mean', threshold=0.5):
if kind=='wet':
df = df[df[column] > threshold]
else:
df = df[df[column] <= threshold]
df = df.resample(rule=rule, how=how)
return df
So, you would call it by:
spell(df, 0)
To get:
0
1960-01-31 0.721519
1960-02-29 0.754054
1960-03-31 0.746341
1960-04-30 0.654872
You can change around the parameters as well:
spell(df, 0, kind='something else', rule='W', how='max', threshold=0.7)
0
1960-01-03 0.570638
1960-01-10 0.529357
1960-01-17 0.565959
1960-01-24 0.682973
1960-01-31 0.676349
1960-02-07 0.379397
1960-02-14 0.680303
1960-02-21 0.654014
1960-02-28 0.546587
1960-03-06 0.699459
1960-03-13 0.626460
1960-03-20 0.611464
1960-03-27 0.685950
1960-04-03 0.688385
1960-04-10 0.697602

Categories

Resources