I'm trying to solve a system of equations:
and I would like to apply fsolve over a pandas dataframe.
How can I do that?
this is my code:
import numpy as np
import pandas as pd
import scipy.optimize as opt
a = np.linspace(300,400,30)
b = np.random.randint(700,18000,30)
c = np.random.uniform(1.4,4.0,30)
df = pd.DataFrame({'A':a, 'B':b, 'C':c})
def func(zGuess,*Params):
x,y,z = zGuess
a,b,c = Params
eq_1 = ((3.47-np.log10(y))**2+(np.log10(c)+1.22)**2)**0.5
eq_2 = (a/101.32) * (101.32/b)** z
eq_3 = 0.381 * x + 0.05 * (b/101.32) -0.15
return eq_1,eq_2,eq_3
zGuess = np.array([2.6,20.2,0.92])
df['result']= df.apply(lambda x: opt.fsolve(func,zGuess,args=(x['A'],x['B'],x['C'])))
But still not working, and I can't see the problem
The error: KeyError: 'A'
basically means he can't find the reference to 'A'
Thats happening because apply doesn't default to apply on rows.
By setting the parameter 1 at the end, it will iterate on each row, looking for the column reference 'A','B',...
df['result']= df.apply(lambda x: opt.fsolve(func,zGuess,args=(x['A'],x['B'],x['C'])),1)
That however, might not give the desired result, as it will save all the output (an array) into a single column.
For that, make reference to the three columns you want to create, make an interator with zip(*...)
df['output_a'],df['output_b'],df['output_c'] = zip(*df.apply(lambda x: opt.fsolve(func,zGuess,args=(x['A'],x['B'],x['C'])),1) )
Related
I want to run some complex math while aggregating. I wrote the aggregation function:
import math as mt
# convert calc cols to float from object
cols = dfg_dom_smry.columns
cols = cols[2:]
for col in cols:
df[col] = df[col].astype(float)
# groupby fields
index = ['PA']
#aggregate
df = dfdom.groupby(index).agg({'newcol1': (mt.sqrt(sum('savings'*'rp')**2))/sum('savings')})
I got an error: TypeError: can't multiply sequence by non-int of type 'str'
This is an extract of my data. The full data has many set of savings and rp columns. So ideally I want to run a for loop for each set of savings and rp columns
PA domain savings rp
M M-RET-COM 383,895.36 0.14
P P-RET-AG 14,302,804.19 0.16
P P-RET-COM 56,074,119.28 0.33
P P-RET-IND 46,677,610.00 0.27
P P-SBD/NC-AG 1,411,905.00 -
P P-SBD/NC-COM 4,255,891.25 0.36
P P-SBD/NC-IND 295,365.00 -
S S-RET-AG 2,391,504.33 0.72
S S-RET-COM 19,195,073.84 0.18
S S-RET-IND 17,677,708.38 0.13
S S-SBD/NC-COM 6,116,407.07 0.05
D D-RET-COM 11,944,490.39 0.15
D D-RET-IND 1,213,117.63 -
D D-SBD/NC-COM 2,708,153.57 0.69
C C-RET-AG
C C-RET-COM
C C-RET-IND
For the above data this would be the final result:
PA newcol1
M 0.143027374757981
P 0.18601700701305
S 0.0979541706738756
D 0.166192684106493
C
thanks for your help
What about
o = dfdom.groupby(index).apply(
lambda s: pow(pow(s.savings*s.rp, 2).sum(), .5)/(s.savings.sum() or 1)
)
?
Where s above stands for pandas.Series.
Also, note that o is an instance of pandas.Series, which means that you will have to convert it into a pandas.DataFrame, at least to justify the name you give it, i.e. df. You can do so by doing:
df = o.to_frame('the column name you want')
Put differently/parametrically
def rollup(df, index, svgs, rp, col_name):
return df.groupby(index).apply(
lambda s: pow(pow(s[svgs]*s[rp], 2).sum(), .5)/(s[svgs].sum() or 1)
).to_frame(col_name)
# rollup(dfdom, index, 'savings', 'rp', 'svgs_rp')
update: I replaced the code below with that in the accepted answer.
This is what I finally did. I created a function to step through each of the calculations and call the function for each set of savings and rp cols.
# the function
def rollup(df, index, svgs, rp):
df['svgs_rp'] = (df[svgs]*df[rp])**2
df2 = df.groupby(index).agg({'svgs_rp':'sum',
svgs:'sum'})
df2['temp'] = np.where((df2[svgs] == 0), '', ((df2['svgs_rp']**(1/2))/df2[svgs]))
df2 = df2['temp']
return df2
#calling the function
index = ['PA']
# the next part is within a for loop to go through all the savings and rp column sets. For simplicity I've removed the loop.
svgs = 'savings1'
rp = 'rp1'
dftemp = rollup(dfdom, index, svgs, rp)
dftemp.rename({'temp': newcol1}, axis=1, inplace=True)
df = pd.merge(df, dftemp, on=index, how = 'left') # this last step was not put in my original question. I've provided so the r-code below makes sense.
annoying that I have to first do the math in new columns and then aggregate. This is the equivalent code in R:
# the function
roll_up <- function(savings,rp){
return(sqrt(sum((savings*rp)^2,na.rm=T))/sum(savings,na.rm=T))
# calling the function
df <- df[dfdom[,.(newcol1=roll_up(savings1,rp1),
newcol2=roll_up(savings2,rp2),...
newcol_n=roll_up(savings_n,rp_n)),by='PA'],on='PA']
I'm relatively new to python programming, and this the best I could come up with. If there is a better way to do this, please share. Thanks.
Your groupby should have () and then the [] like this:
df = dfdom.groupby([index]).agg({'newcol1': (mt.sqrt(sum('savings'*'rp')^2))/sum('savings')})
I have a dataframe with numbers, and they are printed out using the print command so I know it is in the dataframe. But when I do my equations and my conditional they are not in the variables.
import pandas as pd
data = pd.read_excel('Cam_practice1.xlsx')
df = pd.DataFrame(data, columns = ['x_block', 'y_block'])
print(df)
equation_x = ((df.x_block))**2
equation_y = ((df.y_block))**2
eq = equation_x + equation_y
if eq <=4 :
df.to_csv('gridoutput.csv')
What I want is with the complete formula eq, when that value is less than or equal to 4 I want the row to written to a new output. Where am I going wrong?
You can do this:
equation_x = ((df.x_block))**2
equation_y = ((df.y_block))**2
eq = equation_x + equation_y
df[eq<=4].to_csv('gridoutput.csv')
I have a pandas dataframe in the following format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
3,2017-07-15,thing3,55,17
3,2016-05-12,thing3,55,47
4,2012-02-23,thing2,150,22
4,2009-10-10,thing1,25,12
4,2014-04-04,thing2,150,2
5,2008-07-09,thing2,150,43
I have written the following to create two new fields indicating 30 day windows:
import numpy as np
import pandas as pd
start_date_period = pd.period_range('2004-01-01', '12-31-2017', freq='30D')
end_date_period = pd.period_range('2004-01-30', '12-31-2017', freq='30D')
def find_window_start_date(x):
window_start_date_idx = np.argmax(x < start_date_period.end_time)
return start_date_period[window_start_date_idx]
df['window_start_dt'] = df['transaction_dt'].apply(find_window_start_date)
def find_window_end_date(x):
window_end_date_idx = np.argmin(x > end_date_period.start_time)
return end_date_period[window_end_date_idx]
df['window_end_dt'] = df['transaction_dt'].apply(find_window_end_date)
Unfortunately, this is far too slow doing the row-wise apply for my application. I would greatly appreciate any tips on vectorizing these functions if possible.
EDIT:
The resultant dataframe should have this layout:
'customer_id','transaction_dt','product','price','units','window_start_dt','window_end_dt'
It does not need to be resampled or windowed in the formal sense. It just needs 'window_start_dt' and 'window_end_dt' columns to be added. The current code works, it just need to be vectorized if possible.
EDIT 2: pandas.cut is built-in:
tt=[[1,'2004-01-02',0.1,25,47],
[1,'2004-01-17',0.2,150,8],
[2,'2004-01-29',0.2,150,25],
[3,'2017-07-15',0.3,55,17],
[3,'2016-05-12',0.3,55,47],
[4,'2012-02-23',0.2,150,22],
[4,'2009-10-10',0.1,25,12],
[4,'2014-04-04',0.2,150,2],
[5,'2008-07-09',0.2,150,43]]
start_date_period = pd.date_range('2004-01-01', '12-01-2017', freq='MS')
end_date_period = pd.date_range('2004-01-30', '12-31-2017', freq='M')
df = pd.DataFrame(tt,columns=['customer_id','transaction_dt','product','price','units'])
df['transaction_dt'] = pd.Series([pd.to_datetime(sub_t[1],format='%Y-%m-%d') for sub_t in tt])
the_cut = pd.cut(df['transaction_dt'],bins=start_date_period,right=True,labels=False,include_lowest=True)
df['win_start_test'] = pd.Series([start_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
df['win_end_test'] = pd.Series([end_date_period[int(x)] if not np.isnan(x) else 0 for x in the_cut])
print(df.head())
win_start_test and win_end_test should be equal to their counterparts computed using your function.
The ValueError was coming from not casting x to int in the relevant line. I also added a NaN check, though it wasn't needed for this toy example.
Note the change to pd.date_range and the use of the start-of-month and end-of-month flags M and MS, as well as converting the date strings into datetime.
I want to make a For Loop given below, faster in python.
import pandas as pd
import numpy as np
import scipy
np.random.seed(1)
xl = pd.DataFrame({'Concat' : np.arange(101,999), 'ships_x' : np.random.randint(1001,3000,size=898)})
yl = pd.DataFrame({'PickDate' : np.random.randint(1,8,size=10000),'Concat' : np.random.randint(101,999,size=10000), 'ships_x' : np.random.randint(101,300,size=10000), 'ships_y' : np.random.randint(1001,3000,size=10000)})
tempno = [np.random.randint(1,100,size=5)]
k=1
p = pd.DataFrame(0,index=np.arange(len(xl)),columns=['temp','cv']).astype(object)
for ib in [xb for xb in range(0,len(xl))]:
tempno1 = np.append(tempno,ib)
temp = list(set(tempno1))
temptab = yl[yl['Concat'].isin(np.array(xl['Concat'][tempno1]))].groupby('PickDate')['ships_x','ships_y'].sum().reset_index()
temptab['contri'] = temptab['ships_x']/temptab['ships_y']
p.ix[k-1,'cv'] = 1 if math.isnan(scipy.stats.variation(temptab['contri'])) else scipy.stats.variation(temptab['contri'])
p.ix[k-1,'temp'] = temp
k = k+1
where,
xl, yl - two data frames I am working on with columns like Concat, x_ships and y_ships.
tempno - a initial list of indices of xl dataframe, referring to a list of 'Concat' values.
So, in for loop we add one extra index to tempno in each iteration and then subset 'yl' dataframe based on 'Concat' values matching with those of 'xl' dataframe. Then, we find "coefficient of variation"(taken from scipy lib) and make note in new dataframe 'p'.
The problem is it is taking too much time as number of iterations of for loop varies in thousands. The 'group_by' line is taking maximum time. I have tried and made a few changes, now the code look likes below, changes made mentioned in comments. There is a slight improvement but this doesn't solve my purpose. Please suggest the fastest way possible to implement this. Many thanks.
# Getting all tempno1 into a list with one step
tempno1 = [np.append(tempno,ib) for ib in [xb for xb in range(0,len(xl))]]
temp = [list(set(tempk)) for tempk in tempno1]
# Taking only needed columns from x and y dfs
xtemp = xl[['Concat']]
ytemp = yl[['Concat','ships_x','ships_y','PickDate']]
#Shortlisting y df and groupby in two diff steps
ytemp = [ytemp[ytemp['Concat'].isin(np.array(xtemp['Concat'][tempnokk]))] for tempnokk in tempno1]
temptab = [ytempk.groupby('PickDate')['ships_x','ships_y'].sum().reset_index() for ytempk in ytemp]
tempkcontri = [tempk['ships_x']/tempk['ships_y'] for tempk in temptab]
tempkcontri = [pd.DataFrame(tempkcontri[i],columns=['contri']) for i in range(0,len(tempkcontri))]
temptab = [temptab[i].join(tempkcontri[i]) for i in range(0,len(temptab))]
pcv = [1 if math.isnan(scipy.stats.variation(temptabkk['contri'])) else scipy.stats.variation(temptabkk['contri']) for temptabkk in temptab]
p = pd.DataFrame({'temp' : temp,'cv': pcv})
I have a DataFrame of force-displacement data. The displacement array has been set to the DataFrame index, and the columns are my various force curves for different tests.
How do I calculate the work done (which is "the area under the curve")?
I looked at numpy.trapz which seems to do what I need, but I think that I can avoid looping over each column like this:
import numpy as np
import pandas as pd
forces = pd.read_csv(...)
work_done = {}
for col in forces.columns:
work_done[col] = np.trapz(forces.loc[col], forces.index))
I was hoping to create a new DataFrame of the areas under the curves rather than a dict, and thought that DataFrame.apply() or something might be appropriate but don't know where to start looking.
In short:
Can I avoid the looping?
Can I create a DataFrame of work done directly?
Thanks in advance for any help.
You could vectorize this by passing the whole DataFrame to np.trapz and specifying the axis= argument, e.g.:
import numpy as np
import pandas as pd
# some random input data
gen = np.random.RandomState(0)
x = gen.randn(100, 10)
names = [chr(97 + i) for i in range(10)]
forces = pd.DataFrame(x, columns=names)
# vectorized version
wrk = np.trapz(forces, x=forces.index, axis=0)
work_done = pd.DataFrame(wrk[None, :], columns=forces.columns)
# non-vectorized version for comparison
work_done2 = {}
for col in forces.columns:
work_done2.update({col:np.trapz(forces.loc[:, col], forces.index)})
These give the following output:
from pprint import pprint
pprint(work_done.T)
# 0
# a -24.331560
# b -10.347663
# c 4.662212
# d -12.536040
# e -10.276861
# f 3.406740
# g -3.712674
# h -9.508454
# i -1.044931
# j 15.165782
pprint(work_done2)
# {'a': -24.331559643023006,
# 'b': -10.347663159421426,
# 'c': 4.6622123535050459,
# 'd': -12.536039649161403,
# 'e': -10.276861220217308,
# 'f': 3.4067399176289994,
# 'g': -3.7126739591045541,
# 'h': -9.5084536839888187,
# 'i': -1.0449311137294459,
# 'j': 15.165781517623724}
There are a couple of other problems with your original example. col is a column name rather than a row index, so it needs to index the second dimension of your dataframe (i.e. .loc[:, col] rather than .loc[col]). Also, you have an extra trailing parenthesis on the last line.
Edit:
You could also generate the output DataFrame directly by .applying np.trapz to each column, e.g.:
work_done = forces.apply(np.trapz, axis=0, args=(forces.index,))
However, this isn't really 'proper' vectorization - you are still calling np.trapz separately on each column. You can see this by comparing the speed of the .apply version against calling np.trapz directly:
In [1]: %timeit forces.apply(np.trapz, axis=0, args=(forces.index,))
1000 loops, best of 3: 582 µs per loop
In [2]: %timeit np.trapz(forces, x=forces.index, axis=0)
The slowest run took 6.04 times longer than the fastest. This could mean that an
intermediate result is being cached
10000 loops, best of 3: 53.4 µs per loop
This isn't an entirely fair comparison, since the second version excludes the extra time taken to construct the DataFrame from the output numpy array, but this should still be smaller than the difference in time taken to perform the actual integration.
Here's how to get the cumulative integral along a dataframe column using the trapezoidal rule. Alternatively, the following creates a pandas.Series method for doing your choice of Trapezoidal, Simpson's or Romberger's rule (source):
import pandas as pd
from scipy import integrate
import numpy as np
#%% Setup Functions
def integrate_method(self, how='trapz', unit='s'):
'''Numerically integrate the time series.
#param how: the method to use (trapz by default)
#return
Available methods:
* trapz - trapezoidal
* cumtrapz - cumulative trapezoidal
* simps - Simpson's rule
* romb - Romberger's rule
See http://docs.scipy.org/doc/scipy/reference/integrate.html for the method details.
or the source code
https://github.com/scipy/scipy/blob/master/scipy/integrate/quadrature.py
'''
available_rules = set(['trapz', 'cumtrapz', 'simps', 'romb'])
if how in available_rules:
rule = integrate.__getattribute__(how)
else:
print('Unsupported integration rule: %s' % (how))
print('Expecting one of these sample-based integration rules: %s' % (str(list(available_rules))))
raise AttributeError
if how is 'cumtrapz':
result = rule(self.values)
result = np.insert(result, 0, 0, axis=0)
else:
result = rule(self.values)
return result
pd.Series.integrate = integrate_method
#%% Setup (random) data
gen = np.random.RandomState(0)
x = gen.randn(100, 10)
names = [chr(97 + i) for i in range(10)]
df = pd.DataFrame(x, columns=names)
#%% Cummulative Integral
df_cummulative_integral = df.apply(lambda x: x.integrate('cumtrapz'))
df_integral = df.apply(lambda x: x.integrate('trapz'))
df_do_they_match = df_cummulative_integral.tail(1).round(3) == df_integral.round(3)
if df_do_they_match.all().all():
print("Trapz produces the last row of cumtrapz")